june 14 2017 berlin nlp crosslinguistic kate mccurdy word...
TRANSCRIPT
Grammatical and topical gender in crosslinguistic word embeddingsKate McCurdyBerlin NLPJune 14 2017
Word embeddings: From (almost) scratch to NLP● Goal: word representations that...
○ capture maximal semantic/syntactic information, yet○ require minimal task-specific feature engineering
● Neural embeddings to the rescue!○ Input: barely processed, massive corpora
■ In general: tokenization + trimming the long tail in vocab
■ Collobert et al.: capitalization as feature + a few extra tweaks
■ Mikolov et al: n-gram phrase identification
○ Output: dense, magically performant vectors
… but there are pitfalls
You shall know a word by the company it keeps.
Firth 1957
Pitfall #1What if your words keep
company with some unsavory stereotypes?
Analogous relations in the GloVe word embedding; from Caliskan-Islam et al 2016
Stereotypes in word embeddings:Bolukbasi et al. 2016
addiction
accountant
pilot
athlete
professor emeritus
eating disorder
paralegal
flight attendant
gymnast
associate professor
:
:
:
:
:
Bias in humans: the Implicit Association Test● Standard psychological
test to assess implicit bias● Design:
○ Two sets of attribute words■ Male, man, boy, …
■ Female, woman, …
○ Two sets of target words■ Children, wedding,...
■ Office, salary, …
○ Task: left vs right fast
categorization of both sets
● Measurement: differential association in average response timeGreenwald et al. 1998
● WEAT: the Word Embedding Association Test
● Parallels the Implicit Association Test ● Measures the differential association
between paired target and attribute word sets via cosine distance
● Core finding: nearly every single prejudice uncovered by the IAT is replicated by the WEAT on Google News + GloVe word embeddings
Pitfall #1What if your words keep
company with some unsavory stereotypes?
Pitfall #2What if your content words hang out with your function
words and make weird artefacts?
Work with Oguz Serbetci (not pictured)
Crosslinguistic word embeddings
Data
● Corpus: OpenSubtitles● ~5.5K movies with subtitles in 4 languages (2.6-2.9m ws):
○ German - grammatical gender
○ Spanish - grammatical gender
○ Dutch - grammatical gender orthogonal to “natural” gender
○ English - “natural” gender
● Lemmatized each corpus to remove gender● Trained 10 word2vec CBOW embeddings per condition:
○ Language (4) x
○ Corpus version (2 - unprocessed vs lemmatized)
Method● Measurement:
○ differential association using the Word Embedding Association Test (WEAT - Caliskan et al.)
{male} {female}
{career} {family}
Method● Measurement:
○ differential association using the Word Embedding Association Test (WEAT - Caliskan et al.)
● Comparisons:○ “Topical” semantic gender bias
■ replicate IAT findings of Caliskan et al. on dimension
male:career::female:family
Method● Measurement:
○ differential association using the Word Embedding Association Test (WEAT - Caliskan et al.)
● Comparisons:○ “Topical” semantic gender bias
■ replicate IAT findings of Caliskan et al. on dimension
male:career::female:family
○ Grammatical gender bias ■ use stimuli from Phillips & Boroditsky on dimension
male:masculine::female:feminine■ e.g. Spanish el sol (m), German die Sonne (f)
Topical gender bias
≈ average increase in cosine similarity per word
Topical gender bias Grammatical gender bias
Pitfall #2What if your content words hang out with your function
words and make weird artefacts?
Words can keep strange company!And arbitrary properties like grammatical gender can distort your embeddings.
Thank! Q?
ReferencesBolukbasi, T., Chang, K.-W., Zou, J.,
Saligrama, V., & Kalai, A. (2016). Quantifying and reducing stereotypes in word embeddings. arXiv Preprint arXiv:1606.06121.
Caliskan-Islam, A., Bryson, J. J., & Narayanan, A. (2016). Semantics derived automatically from language corpora necessarily contain human biases. arXiv Preprint arXiv:1608.07187.
Caliskan, A., Bryson, J. J., & Narayanan, A. (2017). Semantics derived automatically from language corpora contain human-like biases. Science, 356(6334), 183–186.
Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., & Kuksa, P. (2011). Natural language processing (almost) from scratch. Journal of Machine Learning Research, 12(Aug), 2493–2537.
Firth, John R. 1957. A synopsis of linguistic theory 1930–1955. In Studies in linguistic analysis, 1–32. Oxford: Blackwell.
Greenwald, A. G., McGhee, D. E., & Schwartz, J. L. (1998). Measuring individual differences in implicit cognition: the implicit association test. Journal of Personality and Social Psychology, 74(6), 1464.
Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems (pp. 3111–3119).
Appendix
Interaction between topical and grammatical gender effects in DE + ES
Stereotypes in word embeddings:Bolukbasi et al. 2016
1. Define gender subspace
Stereotypes in word embeddings:Bolukbasi et al. 2016
1. Define gender subspace
2. Project profession names
onto subspace
Stereotypes in word embeddings:Bolukbasi et al. 2016
1. Define gender subspace
2. Project profession names
onto subspace
3. Generate analogies & get
stereotype ratings from MTurk
addiction
accountant
pilot
athlete
professor emeritus
eating disorder
paralegal
flight attendant
gymnast
associate professor
:
:
:
:
:
Stereotypes in word embeddings:Bolukbasi et al. 2016
1. Define gender subspace
2. Project profession names
onto subspace
3. Generate analogies & get
stereotype ratings from MTurk
4. Compute transformation matrix
to debias designated words