june 14 2017 berlin nlp crosslinguistic kate mccurdy word...

28
Grammatical and topical gender in crosslinguistic word embeddings Kate McCurdy Berlin NLP June 14 2017

Upload: others

Post on 16-Jul-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: June 14 2017 Berlin NLP crosslinguistic Kate McCurdy word ...anacode.de/.../06/Kate-McCurdy-Grammatical-gender... · German - grammatical gender Spanish - grammatical gender Dutch

Grammatical and topical gender in crosslinguistic word embeddingsKate McCurdyBerlin NLPJune 14 2017

Page 2: June 14 2017 Berlin NLP crosslinguistic Kate McCurdy word ...anacode.de/.../06/Kate-McCurdy-Grammatical-gender... · German - grammatical gender Spanish - grammatical gender Dutch

Word embeddings: From (almost) scratch to NLP● Goal: word representations that...

○ capture maximal semantic/syntactic information, yet○ require minimal task-specific feature engineering

● Neural embeddings to the rescue!○ Input: barely processed, massive corpora

■ In general: tokenization + trimming the long tail in vocab

■ Collobert et al.: capitalization as feature + a few extra tweaks

■ Mikolov et al: n-gram phrase identification

○ Output: dense, magically performant vectors

Page 3: June 14 2017 Berlin NLP crosslinguistic Kate McCurdy word ...anacode.de/.../06/Kate-McCurdy-Grammatical-gender... · German - grammatical gender Spanish - grammatical gender Dutch

… but there are pitfalls

Page 4: June 14 2017 Berlin NLP crosslinguistic Kate McCurdy word ...anacode.de/.../06/Kate-McCurdy-Grammatical-gender... · German - grammatical gender Spanish - grammatical gender Dutch

You shall know a word by the company it keeps.

Firth 1957

Page 5: June 14 2017 Berlin NLP crosslinguistic Kate McCurdy word ...anacode.de/.../06/Kate-McCurdy-Grammatical-gender... · German - grammatical gender Spanish - grammatical gender Dutch

Pitfall #1What if your words keep

company with some unsavory stereotypes?

Page 6: June 14 2017 Berlin NLP crosslinguistic Kate McCurdy word ...anacode.de/.../06/Kate-McCurdy-Grammatical-gender... · German - grammatical gender Spanish - grammatical gender Dutch

Analogous relations in the GloVe word embedding; from Caliskan-Islam et al 2016

Page 7: June 14 2017 Berlin NLP crosslinguistic Kate McCurdy word ...anacode.de/.../06/Kate-McCurdy-Grammatical-gender... · German - grammatical gender Spanish - grammatical gender Dutch

Stereotypes in word embeddings:Bolukbasi et al. 2016

addiction

accountant

pilot

athlete

professor emeritus

eating disorder

paralegal

flight attendant

gymnast

associate professor

:

:

:

:

:

Page 8: June 14 2017 Berlin NLP crosslinguistic Kate McCurdy word ...anacode.de/.../06/Kate-McCurdy-Grammatical-gender... · German - grammatical gender Spanish - grammatical gender Dutch

Bias in humans: the Implicit Association Test● Standard psychological

test to assess implicit bias● Design:

○ Two sets of attribute words■ Male, man, boy, …

■ Female, woman, …

○ Two sets of target words■ Children, wedding,...

■ Office, salary, …

○ Task: left vs right fast

categorization of both sets

● Measurement: differential association in average response timeGreenwald et al. 1998

Page 9: June 14 2017 Berlin NLP crosslinguistic Kate McCurdy word ...anacode.de/.../06/Kate-McCurdy-Grammatical-gender... · German - grammatical gender Spanish - grammatical gender Dutch

● WEAT: the Word Embedding Association Test

● Parallels the Implicit Association Test ● Measures the differential association

between paired target and attribute word sets via cosine distance

● Core finding: nearly every single prejudice uncovered by the IAT is replicated by the WEAT on Google News + GloVe word embeddings

Page 10: June 14 2017 Berlin NLP crosslinguistic Kate McCurdy word ...anacode.de/.../06/Kate-McCurdy-Grammatical-gender... · German - grammatical gender Spanish - grammatical gender Dutch

Pitfall #1What if your words keep

company with some unsavory stereotypes?

Page 11: June 14 2017 Berlin NLP crosslinguistic Kate McCurdy word ...anacode.de/.../06/Kate-McCurdy-Grammatical-gender... · German - grammatical gender Spanish - grammatical gender Dutch

Pitfall #2What if your content words hang out with your function

words and make weird artefacts?

Page 12: June 14 2017 Berlin NLP crosslinguistic Kate McCurdy word ...anacode.de/.../06/Kate-McCurdy-Grammatical-gender... · German - grammatical gender Spanish - grammatical gender Dutch

Work with Oguz Serbetci (not pictured)

Crosslinguistic word embeddings

Page 13: June 14 2017 Berlin NLP crosslinguistic Kate McCurdy word ...anacode.de/.../06/Kate-McCurdy-Grammatical-gender... · German - grammatical gender Spanish - grammatical gender Dutch

Data

● Corpus: OpenSubtitles● ~5.5K movies with subtitles in 4 languages (2.6-2.9m ws):

○ German - grammatical gender

○ Spanish - grammatical gender

○ Dutch - grammatical gender orthogonal to “natural” gender

○ English - “natural” gender

● Lemmatized each corpus to remove gender● Trained 10 word2vec CBOW embeddings per condition:

○ Language (4) x

○ Corpus version (2 - unprocessed vs lemmatized)

Page 14: June 14 2017 Berlin NLP crosslinguistic Kate McCurdy word ...anacode.de/.../06/Kate-McCurdy-Grammatical-gender... · German - grammatical gender Spanish - grammatical gender Dutch

Method● Measurement:

○ differential association using the Word Embedding Association Test (WEAT - Caliskan et al.)

{male} {female}

{career} {family}

Page 15: June 14 2017 Berlin NLP crosslinguistic Kate McCurdy word ...anacode.de/.../06/Kate-McCurdy-Grammatical-gender... · German - grammatical gender Spanish - grammatical gender Dutch

Method● Measurement:

○ differential association using the Word Embedding Association Test (WEAT - Caliskan et al.)

● Comparisons:○ “Topical” semantic gender bias

■ replicate IAT findings of Caliskan et al. on dimension

male:career::female:family

Page 16: June 14 2017 Berlin NLP crosslinguistic Kate McCurdy word ...anacode.de/.../06/Kate-McCurdy-Grammatical-gender... · German - grammatical gender Spanish - grammatical gender Dutch

Method● Measurement:

○ differential association using the Word Embedding Association Test (WEAT - Caliskan et al.)

● Comparisons:○ “Topical” semantic gender bias

■ replicate IAT findings of Caliskan et al. on dimension

male:career::female:family

○ Grammatical gender bias ■ use stimuli from Phillips & Boroditsky on dimension

male:masculine::female:feminine■ e.g. Spanish el sol (m), German die Sonne (f)

Page 17: June 14 2017 Berlin NLP crosslinguistic Kate McCurdy word ...anacode.de/.../06/Kate-McCurdy-Grammatical-gender... · German - grammatical gender Spanish - grammatical gender Dutch

Topical gender bias

≈ average increase in cosine similarity per word

Page 18: June 14 2017 Berlin NLP crosslinguistic Kate McCurdy word ...anacode.de/.../06/Kate-McCurdy-Grammatical-gender... · German - grammatical gender Spanish - grammatical gender Dutch

Topical gender bias Grammatical gender bias

Page 19: June 14 2017 Berlin NLP crosslinguistic Kate McCurdy word ...anacode.de/.../06/Kate-McCurdy-Grammatical-gender... · German - grammatical gender Spanish - grammatical gender Dutch

Pitfall #2What if your content words hang out with your function

words and make weird artefacts?

Page 20: June 14 2017 Berlin NLP crosslinguistic Kate McCurdy word ...anacode.de/.../06/Kate-McCurdy-Grammatical-gender... · German - grammatical gender Spanish - grammatical gender Dutch

Words can keep strange company!And arbitrary properties like grammatical gender can distort your embeddings.

Page 21: June 14 2017 Berlin NLP crosslinguistic Kate McCurdy word ...anacode.de/.../06/Kate-McCurdy-Grammatical-gender... · German - grammatical gender Spanish - grammatical gender Dutch

Thank! Q?

Page 22: June 14 2017 Berlin NLP crosslinguistic Kate McCurdy word ...anacode.de/.../06/Kate-McCurdy-Grammatical-gender... · German - grammatical gender Spanish - grammatical gender Dutch

ReferencesBolukbasi, T., Chang, K.-W., Zou, J.,

Saligrama, V., & Kalai, A. (2016). Quantifying and reducing stereotypes in word embeddings. arXiv Preprint arXiv:1606.06121.

Caliskan-Islam, A., Bryson, J. J., & Narayanan, A. (2016). Semantics derived automatically from language corpora necessarily contain human biases. arXiv Preprint arXiv:1608.07187.

Caliskan, A., Bryson, J. J., & Narayanan, A. (2017). Semantics derived automatically from language corpora contain human-like biases. Science, 356(6334), 183–186.

Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., & Kuksa, P. (2011). Natural language processing (almost) from scratch. Journal of Machine Learning Research, 12(Aug), 2493–2537.

Firth, John R. 1957. A synopsis of linguistic theory 1930–1955. In Studies in linguistic analysis, 1–32. Oxford: Blackwell.

Greenwald, A. G., McGhee, D. E., & Schwartz, J. L. (1998). Measuring individual differences in implicit cognition: the implicit association test. Journal of Personality and Social Psychology, 74(6), 1464.

Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems (pp. 3111–3119).

Page 23: June 14 2017 Berlin NLP crosslinguistic Kate McCurdy word ...anacode.de/.../06/Kate-McCurdy-Grammatical-gender... · German - grammatical gender Spanish - grammatical gender Dutch

Appendix

Page 24: June 14 2017 Berlin NLP crosslinguistic Kate McCurdy word ...anacode.de/.../06/Kate-McCurdy-Grammatical-gender... · German - grammatical gender Spanish - grammatical gender Dutch

Interaction between topical and grammatical gender effects in DE + ES

Page 25: June 14 2017 Berlin NLP crosslinguistic Kate McCurdy word ...anacode.de/.../06/Kate-McCurdy-Grammatical-gender... · German - grammatical gender Spanish - grammatical gender Dutch

Stereotypes in word embeddings:Bolukbasi et al. 2016

1. Define gender subspace

Page 26: June 14 2017 Berlin NLP crosslinguistic Kate McCurdy word ...anacode.de/.../06/Kate-McCurdy-Grammatical-gender... · German - grammatical gender Spanish - grammatical gender Dutch

Stereotypes in word embeddings:Bolukbasi et al. 2016

1. Define gender subspace

2. Project profession names

onto subspace

Page 27: June 14 2017 Berlin NLP crosslinguistic Kate McCurdy word ...anacode.de/.../06/Kate-McCurdy-Grammatical-gender... · German - grammatical gender Spanish - grammatical gender Dutch

Stereotypes in word embeddings:Bolukbasi et al. 2016

1. Define gender subspace

2. Project profession names

onto subspace

3. Generate analogies & get

stereotype ratings from MTurk

addiction

accountant

pilot

athlete

professor emeritus

eating disorder

paralegal

flight attendant

gymnast

associate professor

:

:

:

:

:

Page 28: June 14 2017 Berlin NLP crosslinguistic Kate McCurdy word ...anacode.de/.../06/Kate-McCurdy-Grammatical-gender... · German - grammatical gender Spanish - grammatical gender Dutch

Stereotypes in word embeddings:Bolukbasi et al. 2016

1. Define gender subspace

2. Project profession names

onto subspace

3. Generate analogies & get

stereotype ratings from MTurk

4. Compute transformation matrix

to debias designated words