w4705: natural language processing fei-tzin leekathy/nlp/2019/9-23-slides.pdffigure: output from the...

71
Motivation Distributional semantics word2vec Analogies An introduction to word embeddings W4705: Natural Language Processing Fei-Tzin Lee September 23, 2019 Fei-Tzin Lee Word embeddings September 23, 2019 1 / 43

Upload: others

Post on 21-Jun-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: W4705: Natural Language Processing Fei-Tzin Leekathy/NLP/2019/9-23-slides.pdfFigure: Output from the gensim package using word2vec vectors pretrained on Google News. This is not a

Motivation Distributional semantics word2vec Analogies

An introduction to word embeddingsW4705: Natural Language Processing

Fei-Tzin Lee

September 23, 2019

Fei-Tzin Lee Word embeddings September 23, 2019 1 / 43

Page 2: W4705: Natural Language Processing Fei-Tzin Leekathy/NLP/2019/9-23-slides.pdfFigure: Output from the gensim package using word2vec vectors pretrained on Google News. This is not a

Motivation Distributional semantics word2vec Analogies

Today

1 What are these word embedding things, anyway?

2 Distributional semantics

3 word2vec

4 Analogies with word embeddings

Fei-Tzin Lee Word embeddings September 23, 2019 2 / 43

Page 3: W4705: Natural Language Processing Fei-Tzin Leekathy/NLP/2019/9-23-slides.pdfFigure: Output from the gensim package using word2vec vectors pretrained on Google News. This is not a

Motivation Distributional semantics word2vec Analogies

Today

1 What are these word embedding things, anyway?

2 Distributional semantics

3 word2vec

4 Analogies with word embeddings

Fei-Tzin Lee Word embeddings September 23, 2019 3 / 43

Page 4: W4705: Natural Language Processing Fei-Tzin Leekathy/NLP/2019/9-23-slides.pdfFigure: Output from the gensim package using word2vec vectors pretrained on Google News. This is not a

Motivation Distributional semantics word2vec Analogies

Representing knowledge

Humans have rich internal representations of words that let us do all sortsof intuitive operations, including (de)composition into other concepts.

• “parent’s sibling” = ‘aunt’ - ‘woman’ = ‘uncle’ - ‘man’

• The attribute of a banana that is ‘yellow’ is the same attribute of anapple that is ‘red’.

But this is obviously impossible for machines. There’s no numericalrepresentation of words that encodes these sorts of abstract relationships.

...Right?

Fei-Tzin Lee Word embeddings September 23, 2019 4 / 43

Page 5: W4705: Natural Language Processing Fei-Tzin Leekathy/NLP/2019/9-23-slides.pdfFigure: Output from the gensim package using word2vec vectors pretrained on Google News. This is not a

Motivation Distributional semantics word2vec Analogies

Representing knowledge

Humans have rich internal representations of words that let us do all sortsof intuitive operations, including (de)composition into other concepts.

• “parent’s sibling” = ‘aunt’ - ‘woman’ = ‘uncle’ - ‘man’

• The attribute of a banana that is ‘yellow’ is the same attribute of anapple that is ‘red’.

But this is obviously impossible for machines. There’s no numericalrepresentation of words that encodes these sorts of abstract relationships.

...Right?

Fei-Tzin Lee Word embeddings September 23, 2019 4 / 43

Page 6: W4705: Natural Language Processing Fei-Tzin Leekathy/NLP/2019/9-23-slides.pdfFigure: Output from the gensim package using word2vec vectors pretrained on Google News. This is not a

Motivation Distributional semantics word2vec Analogies

A bit of magic

Figure: Output from the gensim package using word2vec vectors pretrained on Google News.

• This is not a fancy language model

• No external knowledge base

• Just vector addition and subtraction with cosine similarity

Fei-Tzin Lee Word embeddings September 23, 2019 5 / 43

Page 7: W4705: Natural Language Processing Fei-Tzin Leekathy/NLP/2019/9-23-slides.pdfFigure: Output from the gensim package using word2vec vectors pretrained on Google News. This is not a

Motivation Distributional semantics word2vec Analogies

A bit of magic? math

Where did these magical vectors come from? This trick works in a fewdifferent flavors:

• SVD-based vectors

• word2vec, from the example above, and other neural embeddings

• GloVe, something akin to a hybrid method

Fei-Tzin Lee Word embeddings September 23, 2019 6 / 43

Page 8: W4705: Natural Language Processing Fei-Tzin Leekathy/NLP/2019/9-23-slides.pdfFigure: Output from the gensim package using word2vec vectors pretrained on Google News. This is not a

Motivation Distributional semantics word2vec Analogies

Word embeddings

The semantic representations that have become the de facto standard inNLP are word embeddings, vector representations that are

• Distributed: information is distributed throughout indices (rather thansparse)

• Distributional: information is derived from a word’s distribution in acorpus (how it occurs in text)

These can be viewed as an embedding from a discrete space of words intoa continuous vector space.

Fei-Tzin Lee Word embeddings September 23, 2019 7 / 43

Page 9: W4705: Natural Language Processing Fei-Tzin Leekathy/NLP/2019/9-23-slides.pdfFigure: Output from the gensim package using word2vec vectors pretrained on Google News. This is not a

Motivation Distributional semantics word2vec Analogies

Applications

• Language modeling

• Machine translation

• Sentiment analysis

• Summarization

• etc...

Basically, anything that uses neural nets can use word embeddings too,and some other things besides.

Fei-Tzin Lee Word embeddings September 23, 2019 8 / 43

Page 10: W4705: Natural Language Processing Fei-Tzin Leekathy/NLP/2019/9-23-slides.pdfFigure: Output from the gensim package using word2vec vectors pretrained on Google News. This is not a

Motivation Distributional semantics word2vec Analogies

Today

1 What are these word embedding things, anyway?

2 Distributional semantics

3 word2vec

4 Analogies with word embeddings

Fei-Tzin Lee Word embeddings September 23, 2019 9 / 43

Page 11: W4705: Natural Language Processing Fei-Tzin Leekathy/NLP/2019/9-23-slides.pdfFigure: Output from the gensim package using word2vec vectors pretrained on Google News. This is not a

Motivation Distributional semantics word2vec Analogies

Words.

What is a word?

• A composition of characters or syllables?

• A pair - usage and meaning.

These are independent of representation. So we can choose whatrepresentation to use in our models.

Fei-Tzin Lee Word embeddings September 23, 2019 10 / 43

Page 12: W4705: Natural Language Processing Fei-Tzin Leekathy/NLP/2019/9-23-slides.pdfFigure: Output from the gensim package using word2vec vectors pretrained on Google News. This is not a

Motivation Distributional semantics word2vec Analogies

Words.

What is a word?

• A composition of characters or syllables?

• A pair - usage and meaning.

These are independent of representation. So we can choose whatrepresentation to use in our models.

Fei-Tzin Lee Word embeddings September 23, 2019 10 / 43

Page 13: W4705: Natural Language Processing Fei-Tzin Leekathy/NLP/2019/9-23-slides.pdfFigure: Output from the gensim package using word2vec vectors pretrained on Google News. This is not a

Motivation Distributional semantics word2vec Analogies

Words.

What is a word?

• A composition of characters or syllables?

• A pair - usage and meaning.

These are independent of representation. So we can choose whatrepresentation to use in our models.

Fei-Tzin Lee Word embeddings September 23, 2019 10 / 43

Page 14: W4705: Natural Language Processing Fei-Tzin Leekathy/NLP/2019/9-23-slides.pdfFigure: Output from the gensim package using word2vec vectors pretrained on Google News. This is not a

Motivation Distributional semantics word2vec Analogies

Words.

What is a word?

• A composition of characters or syllables?

• A pair - usage and meaning.

These are independent of representation. So we can choose whatrepresentation to use in our models.

Fei-Tzin Lee Word embeddings September 23, 2019 10 / 43

Page 15: W4705: Natural Language Processing Fei-Tzin Leekathy/NLP/2019/9-23-slides.pdfFigure: Output from the gensim package using word2vec vectors pretrained on Google News. This is not a

Motivation Distributional semantics word2vec Analogies

So, how?

How do we represent the words in some segment of text in amachine-friendly manner?

• Bag-of-words? (no word order)

• Sequences of numerical indices? (relatively uninformative)

• One-hot vectors? (space-inefficient; curse of dimensionality)

• Scores from lexicons, or hand-engineered features? (expensive andnot scalable)

Plus: none of these tell us how the word is used, or what it actually means.

Fei-Tzin Lee Word embeddings September 23, 2019 11 / 43

Page 16: W4705: Natural Language Processing Fei-Tzin Leekathy/NLP/2019/9-23-slides.pdfFigure: Output from the gensim package using word2vec vectors pretrained on Google News. This is not a

Motivation Distributional semantics word2vec Analogies

So, how?

How do we represent the words in some segment of text in amachine-friendly manner?

• Bag-of-words? (no word order)

• Sequences of numerical indices? (relatively uninformative)

• One-hot vectors? (space-inefficient; curse of dimensionality)

• Scores from lexicons, or hand-engineered features? (expensive andnot scalable)

Plus: none of these tell us how the word is used, or what it actually means.

Fei-Tzin Lee Word embeddings September 23, 2019 11 / 43

Page 17: W4705: Natural Language Processing Fei-Tzin Leekathy/NLP/2019/9-23-slides.pdfFigure: Output from the gensim package using word2vec vectors pretrained on Google News. This is not a

Motivation Distributional semantics word2vec Analogies

So, how?

How do we represent the words in some segment of text in amachine-friendly manner?

• Bag-of-words? (no word order)

• Sequences of numerical indices? (relatively uninformative)

• One-hot vectors? (space-inefficient; curse of dimensionality)

• Scores from lexicons, or hand-engineered features? (expensive andnot scalable)

Plus: none of these tell us how the word is used, or what it actually means.

Fei-Tzin Lee Word embeddings September 23, 2019 11 / 43

Page 18: W4705: Natural Language Processing Fei-Tzin Leekathy/NLP/2019/9-23-slides.pdfFigure: Output from the gensim package using word2vec vectors pretrained on Google News. This is not a

Motivation Distributional semantics word2vec Analogies

So, how?

How do we represent the words in some segment of text in amachine-friendly manner?

• Bag-of-words? (no word order)

• Sequences of numerical indices? (relatively uninformative)

• One-hot vectors? (space-inefficient; curse of dimensionality)

• Scores from lexicons, or hand-engineered features? (expensive andnot scalable)

Plus: none of these tell us how the word is used, or what it actually means.

Fei-Tzin Lee Word embeddings September 23, 2019 11 / 43

Page 19: W4705: Natural Language Processing Fei-Tzin Leekathy/NLP/2019/9-23-slides.pdfFigure: Output from the gensim package using word2vec vectors pretrained on Google News. This is not a

Motivation Distributional semantics word2vec Analogies

So, how?

How do we represent the words in some segment of text in amachine-friendly manner?

• Bag-of-words? (no word order)

• Sequences of numerical indices? (relatively uninformative)

• One-hot vectors? (space-inefficient; curse of dimensionality)

• Scores from lexicons, or hand-engineered features? (expensive andnot scalable)

Plus: none of these tell us how the word is used, or what it actually means.

Fei-Tzin Lee Word embeddings September 23, 2019 11 / 43

Page 20: W4705: Natural Language Processing Fei-Tzin Leekathy/NLP/2019/9-23-slides.pdfFigure: Output from the gensim package using word2vec vectors pretrained on Google News. This is not a

Motivation Distributional semantics word2vec Analogies

So, how?

How do we represent the words in some segment of text in amachine-friendly manner?

• Bag-of-words? (no word order)

• Sequences of numerical indices? (relatively uninformative)

• One-hot vectors? (space-inefficient; curse of dimensionality)

• Scores from lexicons, or hand-engineered features? (expensive andnot scalable)

Plus: none of these tell us how the word is used, or what it actually means.

Fei-Tzin Lee Word embeddings September 23, 2019 11 / 43

Page 21: W4705: Natural Language Processing Fei-Tzin Leekathy/NLP/2019/9-23-slides.pdfFigure: Output from the gensim package using word2vec vectors pretrained on Google News. This is not a

Motivation Distributional semantics word2vec Analogies

What does a word mean?

Let’s try a hands-on exercise.

Obviously, word meaning is really hard for even humans to quantify. Sohow can we possibly generate representations of word meaningautomatically?

We approach it obliquely, using what is known as the distributionalhypothesis.

Fei-Tzin Lee Word embeddings September 23, 2019 12 / 43

Page 22: W4705: Natural Language Processing Fei-Tzin Leekathy/NLP/2019/9-23-slides.pdfFigure: Output from the gensim package using word2vec vectors pretrained on Google News. This is not a

Motivation Distributional semantics word2vec Analogies

What does a word mean?

Let’s try a hands-on exercise.

Obviously, word meaning is really hard for even humans to quantify. Sohow can we possibly generate representations of word meaningautomatically?

We approach it obliquely, using what is known as the distributionalhypothesis.

Fei-Tzin Lee Word embeddings September 23, 2019 12 / 43

Page 23: W4705: Natural Language Processing Fei-Tzin Leekathy/NLP/2019/9-23-slides.pdfFigure: Output from the gensim package using word2vec vectors pretrained on Google News. This is not a

Motivation Distributional semantics word2vec Analogies

The distributional hypothesis

Borrowed from linguistics: the meaning of a word can be determined fromthe contexts it appears in.1

• Words with similar contexts have similar meanings (Harris, 1954)

• “You shall know a word by the company it keeps” (Firth, 1957)

Example: “My homework was no archteryx of academic perfection, but itsufficed.”

1https://aclweb.org/aclwiki/Distributional HypothesisFei-Tzin Lee Word embeddings September 23, 2019 13 / 43

Page 24: W4705: Natural Language Processing Fei-Tzin Leekathy/NLP/2019/9-23-slides.pdfFigure: Output from the gensim package using word2vec vectors pretrained on Google News. This is not a

Motivation Distributional semantics word2vec Analogies

Context?

Most static word embeddings use a simple notion of context - a word is a“context” for another word when it appears close enough to it in the text.

• But we can also use sentences or entire documents as contexts.

In the most basic case, we fix some number of words as our ‘contextwindow’ and count all pairs of words that are less than that many wordsaway from each other as co-occurrences.

Fei-Tzin Lee Word embeddings September 23, 2019 14 / 43

Page 25: W4705: Natural Language Processing Fei-Tzin Leekathy/NLP/2019/9-23-slides.pdfFigure: Output from the gensim package using word2vec vectors pretrained on Google News. This is not a

Motivation Distributional semantics word2vec Analogies

Example time!

Let’s say we have a context window size of 2.

no raw meat pants, please.please do not send me some rawvegetarians.

What are the co-occurrences of ‘send’?

• ‘do’

• ‘not’

• ‘me’

• ‘some’

Fei-Tzin Lee Word embeddings September 23, 2019 15 / 43

Page 26: W4705: Natural Language Processing Fei-Tzin Leekathy/NLP/2019/9-23-slides.pdfFigure: Output from the gensim package using word2vec vectors pretrained on Google News. This is not a

Motivation Distributional semantics word2vec Analogies

Example time!

Let’s say we have a context window size of 2.

no raw meat pants, please.please do not send me some rawvegetarians.

What are the co-occurrences of ‘send’?

• ‘do’

• ‘not’

• ‘me’

• ‘some’

Fei-Tzin Lee Word embeddings September 23, 2019 15 / 43

Page 27: W4705: Natural Language Processing Fei-Tzin Leekathy/NLP/2019/9-23-slides.pdfFigure: Output from the gensim package using word2vec vectors pretrained on Google News. This is not a

Motivation Distributional semantics word2vec Analogies

The co-occurrence matrix

We can collect the context occurrences in our corpus into a co-occurrencematrix Mij , where each row corresponds to a word and each column to acontext word.

Then the entry Mij represents how many times word i appeared within thecontext of word j .

Fei-Tzin Lee Word embeddings September 23, 2019 16 / 43

Page 28: W4705: Natural Language Processing Fei-Tzin Leekathy/NLP/2019/9-23-slides.pdfFigure: Output from the gensim package using word2vec vectors pretrained on Google News. This is not a

Motivation Distributional semantics word2vec Analogies

Example II: the example continues

no raw meat pants, please.please do not send me some rawvegetarians.

What does the co-occurrence matrix look like?

Fei-Tzin Lee Word embeddings September 23, 2019 17 / 43

Page 29: W4705: Natural Language Processing Fei-Tzin Leekathy/NLP/2019/9-23-slides.pdfFigure: Output from the gensim package using word2vec vectors pretrained on Google News. This is not a

Motivation Distributional semantics word2vec Analogies

Other count-based representations

We can generalize this concept a bit further.

• More general contexts - sentences or documents

• Weighted context windows

• Other measures of word-context association (e.g. pointwise mutualinformation)

This gives us the more general notion of a word-context matrix.

Fei-Tzin Lee Word embeddings September 23, 2019 18 / 43

Page 30: W4705: Natural Language Processing Fei-Tzin Leekathy/NLP/2019/9-23-slides.pdfFigure: Output from the gensim package using word2vec vectors pretrained on Google News. This is not a

Motivation Distributional semantics word2vec Analogies

Embedding with matrix factorization

Goal: find a vector for each word wi and context cj such that 〈~w , ~c〉approximates Mij - that is, the association between wi and cj .

Of course, this is easy if we get to use arbitrary-length vectors. But weideally want low-dimensional representations.

How can we do this? Use the singular value decomposition.

Fei-Tzin Lee Word embeddings September 23, 2019 19 / 43

Page 31: W4705: Natural Language Processing Fei-Tzin Leekathy/NLP/2019/9-23-slides.pdfFigure: Output from the gensim package using word2vec vectors pretrained on Google News. This is not a

Motivation Distributional semantics word2vec Analogies

Truncated SVDRecall that the SVD of a matrix M gives us

M = UΣV T .

To approximate M, we truncate to the top k singular values.

Fei-Tzin Lee Word embeddings September 23, 2019 20 / 43

Page 32: W4705: Natural Language Processing Fei-Tzin Leekathy/NLP/2019/9-23-slides.pdfFigure: Output from the gensim package using word2vec vectors pretrained on Google News. This is not a

Motivation Distributional semantics word2vec Analogies

Truncated SVDRecall that the SVD of a matrix M gives us

M = UΣV T .

To approximate M, we truncate to the top k singular values.

Fei-Tzin Lee Word embeddings September 23, 2019 20 / 43

Page 33: W4705: Natural Language Processing Fei-Tzin Leekathy/NLP/2019/9-23-slides.pdfFigure: Output from the gensim package using word2vec vectors pretrained on Google News. This is not a

Motivation Distributional semantics word2vec Analogies

Why truncated SVD?

The truncationMk = UkΣkV

Tk

is the best approximation to M under Frobenius norm.

We can view word vectors ~wi and context vectors ~cj as rows of matricesW and C .

• For W ,C with rows of length k, their product WCT can’tapproximate M better than Mk .

Fei-Tzin Lee Word embeddings September 23, 2019 21 / 43

Page 34: W4705: Natural Language Processing Fei-Tzin Leekathy/NLP/2019/9-23-slides.pdfFigure: Output from the gensim package using word2vec vectors pretrained on Google News. This is not a

Motivation Distributional semantics word2vec Analogies

Embedding with truncated SVD

Of course, truncated SVD has three components, but we only want twomatrices, W and C . Traditionally, we set W = UkΣk and C = Vk .

Fei-Tzin Lee Word embeddings September 23, 2019 22 / 43

Page 35: W4705: Natural Language Processing Fei-Tzin Leekathy/NLP/2019/9-23-slides.pdfFigure: Output from the gensim package using word2vec vectors pretrained on Google News. This is not a

Motivation Distributional semantics word2vec Analogies

Variations on SVD

Note that this preserves inner products between word vectors (rows of M).But W and C are asymmetric. There is no a priori reason that only Cshould be orthogonal.

Instead, using symmetric word and context matrices works better inpractice.

• Split Σ: W = UkΣ1/2k , C = VkΣ

1/2k

• Omit Σ altogether: W = Uk ,C = Vk

Fei-Tzin Lee Word embeddings September 23, 2019 23 / 43

Page 36: W4705: Natural Language Processing Fei-Tzin Leekathy/NLP/2019/9-23-slides.pdfFigure: Output from the gensim package using word2vec vectors pretrained on Google News. This is not a

Motivation Distributional semantics word2vec Analogies

In conclusion?

Simple matrix factorization can actually give us decent wordrepresentations.

Fei-Tzin Lee Word embeddings September 23, 2019 24 / 43

Page 37: W4705: Natural Language Processing Fei-Tzin Leekathy/NLP/2019/9-23-slides.pdfFigure: Output from the gensim package using word2vec vectors pretrained on Google News. This is not a

Motivation Distributional semantics word2vec Analogies

Today

1 What are these word embedding things, anyway?

2 Distributional semantics

3 word2vec

4 Analogies with word embeddings

Fei-Tzin Lee Word embeddings September 23, 2019 25 / 43

Page 38: W4705: Natural Language Processing Fei-Tzin Leekathy/NLP/2019/9-23-slides.pdfFigure: Output from the gensim package using word2vec vectors pretrained on Google News. This is not a

Motivation Distributional semantics word2vec Analogies

Neural models of semantics

In essence, word2vec approximates distributional semantics with a neuralobjective.

Two flavors: skipgram and continuous-bag-of-words (CBOW).Today, we will focus on skipgram with negative sampling.

Fei-Tzin Lee Word embeddings September 23, 2019 26 / 43

Page 39: W4705: Natural Language Processing Fei-Tzin Leekathy/NLP/2019/9-23-slides.pdfFigure: Output from the gensim package using word2vec vectors pretrained on Google News. This is not a

Motivation Distributional semantics word2vec Analogies

Context, revisitedWhen using the matrix-factorization approach, we collect allco-occurrences globally into the same matrix M.

But word2vec takes a slightly different approach - it slides a window overthe corpus, essentially looking at one isolated co-occurrence pair at a time.

Fei-Tzin Lee Word embeddings September 23, 2019 27 / 43

Page 40: W4705: Natural Language Processing Fei-Tzin Leekathy/NLP/2019/9-23-slides.pdfFigure: Output from the gensim package using word2vec vectors pretrained on Google News. This is not a

Motivation Distributional semantics word2vec Analogies

Context, revisitedWhen using the matrix-factorization approach, we collect allco-occurrences globally into the same matrix M.

But word2vec takes a slightly different approach - it slides a window overthe corpus, essentially looking at one isolated co-occurrence pair at a time.

Fei-Tzin Lee Word embeddings September 23, 2019 27 / 43

Page 41: W4705: Natural Language Processing Fei-Tzin Leekathy/NLP/2019/9-23-slides.pdfFigure: Output from the gensim package using word2vec vectors pretrained on Google News. This is not a

Motivation Distributional semantics word2vec Analogies

Context, revisitedWhen using the matrix-factorization approach, we collect allco-occurrences globally into the same matrix M.

But word2vec takes a slightly different approach - it slides a window overthe corpus, essentially looking at one isolated co-occurrence pair at a time.

Fei-Tzin Lee Word embeddings September 23, 2019 27 / 43

Page 42: W4705: Natural Language Processing Fei-Tzin Leekathy/NLP/2019/9-23-slides.pdfFigure: Output from the gensim package using word2vec vectors pretrained on Google News. This is not a

Motivation Distributional semantics word2vec Analogies

Context, revisitedWhen using the matrix-factorization approach, we collect allco-occurrences globally into the same matrix M.

But word2vec takes a slightly different approach - it slides a window overthe corpus, essentially looking at one isolated co-occurrence pair at a time.

Fei-Tzin Lee Word embeddings September 23, 2019 27 / 43

Page 43: W4705: Natural Language Processing Fei-Tzin Leekathy/NLP/2019/9-23-slides.pdfFigure: Output from the gensim package using word2vec vectors pretrained on Google News. This is not a

Motivation Distributional semantics word2vec Analogies

Context, revisitedWhen using the matrix-factorization approach, we collect allco-occurrences globally into the same matrix M.

But word2vec takes a slightly different approach - it slides a window overthe corpus, essentially looking at one isolated co-occurrence pair at a time.

Fei-Tzin Lee Word embeddings September 23, 2019 27 / 43

Page 44: W4705: Natural Language Processing Fei-Tzin Leekathy/NLP/2019/9-23-slides.pdfFigure: Output from the gensim package using word2vec vectors pretrained on Google News. This is not a

Motivation Distributional semantics word2vec Analogies

Context, revisitedWhen using the matrix-factorization approach, we collect allco-occurrences globally into the same matrix M.

But word2vec takes a slightly different approach - it slides a window overthe corpus, essentially looking at one isolated co-occurrence pair at a time.

Fei-Tzin Lee Word embeddings September 23, 2019 27 / 43

Page 45: W4705: Natural Language Processing Fei-Tzin Leekathy/NLP/2019/9-23-slides.pdfFigure: Output from the gensim package using word2vec vectors pretrained on Google News. This is not a

Motivation Distributional semantics word2vec Analogies

The setting

We start with the following things:

• A corpus D of co-occurrence pairs (w , c)

• A word vocabulary V

• A context vocabulary C

Fei-Tzin Lee Word embeddings September 23, 2019 28 / 43

Page 46: W4705: Natural Language Processing Fei-Tzin Leekathy/NLP/2019/9-23-slides.pdfFigure: Output from the gensim package using word2vec vectors pretrained on Google News. This is not a

Motivation Distributional semantics word2vec Analogies

Skipgram, intuitively

Idea: given a word, predict what context surrounds it.

More specifically, we want to model the probabilities of context wordsappearing around any specific word.

Fei-Tzin Lee Word embeddings September 23, 2019 29 / 43

Page 47: W4705: Natural Language Processing Fei-Tzin Leekathy/NLP/2019/9-23-slides.pdfFigure: Output from the gensim package using word2vec vectors pretrained on Google News. This is not a

Motivation Distributional semantics word2vec Analogies

Skipgram, concretely

Our global learning objective is to maximize the probability of the observedcorpus under our model of co-occurrences:

max∏

(w ,c)∈D

P(w , c)

Equivalently, we may maximize the log of this quantity, which gives us themore tractable

max∑

(w ,c)∈D

logP(w , c).

So how exactly do we calculate P(w , c)?

Fei-Tzin Lee Word embeddings September 23, 2019 30 / 43

Page 48: W4705: Natural Language Processing Fei-Tzin Leekathy/NLP/2019/9-23-slides.pdfFigure: Output from the gensim package using word2vec vectors pretrained on Google News. This is not a

Motivation Distributional semantics word2vec Analogies

Skipgram’s modeling assumption

Vanilla skipgram models logP(w , c) as proportional to 〈~w , ~c〉.

What the model actually learns are the vector representations for wordsand contexts that best fit the distribution of the corpus.

Fei-Tzin Lee Word embeddings September 23, 2019 31 / 43

Page 49: W4705: Natural Language Processing Fei-Tzin Leekathy/NLP/2019/9-23-slides.pdfFigure: Output from the gensim package using word2vec vectors pretrained on Google News. This is not a

Motivation Distributional semantics word2vec Analogies

Negative sampling

Since probabilities must sum to 1, we must normalize e〈w ,c〉 by somefactor. In plain skipgram, we use a softmax over all possible context words:

logP(w , c) = loge〈~w ,~c〉∑

c ′∈Ce〈~w ,~c ′〉

.

This denominator is computationally intractable. For efficiency’s sake,negative sampling (based on noise-contrastive estimation) replaces thiswith a sampled approximation:

logP(w , c) ≈ log σ(~w · ~c) +k∑

i=1

Ec ′i∼Pn(c)[log σ(−~w · ~c ′i )].

Fei-Tzin Lee Word embeddings September 23, 2019 32 / 43

Page 50: W4705: Natural Language Processing Fei-Tzin Leekathy/NLP/2019/9-23-slides.pdfFigure: Output from the gensim package using word2vec vectors pretrained on Google News. This is not a

Motivation Distributional semantics word2vec Analogies

The negative sampling objective

So the SGNS objective for each pair (w , c) is

logP(w , c) = log σ(~w · ~c) +k∑

i=1

Ec ′i∼Pn(c)[log σ(−~w · ~c ′i )]

= log σ(~w · ~c) +k∑

i=1

Ec ′i∼Pn(c)[log(1− σ(~w · ~c ′i )].

Fei-Tzin Lee Word embeddings September 23, 2019 33 / 43

Page 51: W4705: Natural Language Processing Fei-Tzin Leekathy/NLP/2019/9-23-slides.pdfFigure: Output from the gensim package using word2vec vectors pretrained on Google News. This is not a

Motivation Distributional semantics word2vec Analogies

Tricks of the trade

In practice, SGNS also uses a couple other tricks to make things run moresmoothly.

• Instead of the empirical unigram distribution, use the unigramdistribution raised to the 3/4th power for noise

• Frequent word subsampling: discard words from the training set withprobability

P(wi ) = 1−√

t

f (wi ),

where t is a small threshold.

Fei-Tzin Lee Word embeddings September 23, 2019 34 / 43

Page 52: W4705: Natural Language Processing Fei-Tzin Leekathy/NLP/2019/9-23-slides.pdfFigure: Output from the gensim package using word2vec vectors pretrained on Google News. This is not a

Motivation Distributional semantics word2vec Analogies

Linear composition, more formally

One of the most exciting results from word2vec is that vectors seem toexhibit some degree of additive compositionality - you can compose wordsby adding their vectors, and remove concepts by subtracting them.

Fei-Tzin Lee Word embeddings September 23, 2019 35 / 43

Page 53: W4705: Natural Language Processing Fei-Tzin Leekathy/NLP/2019/9-23-slides.pdfFigure: Output from the gensim package using word2vec vectors pretrained on Google News. This is not a

Motivation Distributional semantics word2vec Analogies

Additive analogy solving

This lets us solve analogies in the form of A:B :: C:? in a very simple way:take ~B − ~A + ~C , and find the closest word to the result.

Usually when finding the closest word we use cosine distance.

Fei-Tzin Lee Word embeddings September 23, 2019 36 / 43

Page 54: W4705: Natural Language Processing Fei-Tzin Leekathy/NLP/2019/9-23-slides.pdfFigure: Output from the gensim package using word2vec vectors pretrained on Google News. This is not a

Motivation Distributional semantics word2vec Analogies

Additive analogy solving

Figure: From https://www.aclweb.org/anthology/N18-2039.

Fei-Tzin Lee Word embeddings September 23, 2019 37 / 43

Page 55: W4705: Natural Language Processing Fei-Tzin Leekathy/NLP/2019/9-23-slides.pdfFigure: Output from the gensim package using word2vec vectors pretrained on Google News. This is not a

Motivation Distributional semantics word2vec Analogies

Why?

Why might this work?

Fei-Tzin Lee Word embeddings September 23, 2019 38 / 43

Page 56: W4705: Natural Language Processing Fei-Tzin Leekathy/NLP/2019/9-23-slides.pdfFigure: Output from the gensim package using word2vec vectors pretrained on Google News. This is not a

Motivation Distributional semantics word2vec Analogies

Actually...

Additive analogy completion uses some hidden tricks.

• Leave out original query points

• Embeddings are often normalized in preprocessing

d = argmaxw∈V cos(~w , ~b − ~a + ~c)

Fei-Tzin Lee Word embeddings September 23, 2019 39 / 43

Page 57: W4705: Natural Language Processing Fei-Tzin Leekathy/NLP/2019/9-23-slides.pdfFigure: Output from the gensim package using word2vec vectors pretrained on Google News. This is not a

Motivation Distributional semantics word2vec Analogies

Additive analogy solving (full)

Figure: From https://www.aclweb.org/anthology/N18-2039, “The word analogytesting caveat”, Schluter (2019).

22

Fei-Tzin Lee Word embeddings September 23, 2019 40 / 43

Page 58: W4705: Natural Language Processing Fei-Tzin Leekathy/NLP/2019/9-23-slides.pdfFigure: Output from the gensim package using word2vec vectors pretrained on Google News. This is not a

Motivation Distributional semantics word2vec Analogies

Linear composition (in reality)

Turns out, if we don’t rely on these tricks, performance drops by a lot.

• Leave the query points in? Zero accuracy. (Linzen, 2016)

• Performance is best on words with predictable relations (Finley et al.,2017)

• We can actually do a little better on analogies if we allow generallinear transformations instead of just translations (Ethayarajh, 2019),although vector addition is the nice intuitive method

Fei-Tzin Lee Word embeddings September 23, 2019 41 / 43

Page 59: W4705: Natural Language Processing Fei-Tzin Leekathy/NLP/2019/9-23-slides.pdfFigure: Output from the gensim package using word2vec vectors pretrained on Google News. This is not a

Motivation Distributional semantics word2vec Analogies

Be careful, but curious

We need to be careful not to get carried away.

But still, the fact that we get any regularity of this sort at all is reallyexciting!

And word embeddings do give us undeniable performance increases on awide variety of downstream tasks.

Fei-Tzin Lee Word embeddings September 23, 2019 42 / 43

Page 60: W4705: Natural Language Processing Fei-Tzin Leekathy/NLP/2019/9-23-slides.pdfFigure: Output from the gensim package using word2vec vectors pretrained on Google News. This is not a

Motivation Distributional semantics word2vec Analogies

Questions?

Fei-Tzin Lee Word embeddings September 23, 2019 43 / 43

Page 61: W4705: Natural Language Processing Fei-Tzin Leekathy/NLP/2019/9-23-slides.pdfFigure: Output from the gensim package using word2vec vectors pretrained on Google News. This is not a

Appendix GloVe Evaluation

Further reading

• Distributional semantics and neural embeddings

• Analogy performance

Fei-Tzin Lee Word embeddings September 23, 2019 1 / 11

Page 62: W4705: Natural Language Processing Fei-Tzin Leekathy/NLP/2019/9-23-slides.pdfFigure: Output from the gensim package using word2vec vectors pretrained on Google News. This is not a

Appendix GloVe Evaluation

SVD vs word2vec

Neural Word Embedding as Implicit Matrix Factorization. Levy andGoldberg (2014).

Improving Distributional Similarity with Lessons Learned from WordEmbeddings. Levy, Goldberg and Dagan (2015).

Fei-Tzin Lee Word embeddings September 23, 2019 2 / 11

Page 63: W4705: Natural Language Processing Fei-Tzin Leekathy/NLP/2019/9-23-slides.pdfFigure: Output from the gensim package using word2vec vectors pretrained on Google News. This is not a

Appendix GloVe Evaluation

Analogies

Issues in evaluating semantic spaces using word analogies. Linzen (2016).

What Analogies Reveal about Word Vectors and their Compositionality.Finley, Farmer and Pakhomov (2017).The word analogy testing caveat. Schluter (2018).

Fei-Tzin Lee Word embeddings September 23, 2019 3 / 11

Page 64: W4705: Natural Language Processing Fei-Tzin Leekathy/NLP/2019/9-23-slides.pdfFigure: Output from the gensim package using word2vec vectors pretrained on Google News. This is not a

Appendix GloVe Evaluation

Co-occurrence through a different lens

SVD uses all the co-occurrence information in a corpus, but is sensitive tozero terms and performs worse than word2vec in practice. On the otherhand, word2vec fails to use global co-occurrences.

Fei-Tzin Lee Word embeddings September 23, 2019 4 / 11

Page 65: W4705: Natural Language Processing Fei-Tzin Leekathy/NLP/2019/9-23-slides.pdfFigure: Output from the gensim package using word2vec vectors pretrained on Google News. This is not a

Appendix GloVe Evaluation

Setting

We’ll consider a co-occurrence matrix Mij that collects co-occurrencecounts between words i and contexts j .

Again, we’d like to construct matrices W and C containing word andcontext vectors respectively - but this time we also learn individual biasterms for each word and context. Instead of directly factorizing M, thistime we have a different objective.

Fei-Tzin Lee Word embeddings September 23, 2019 5 / 11

Page 66: W4705: Natural Language Processing Fei-Tzin Leekathy/NLP/2019/9-23-slides.pdfFigure: Output from the gensim package using word2vec vectors pretrained on Google News. This is not a

Appendix GloVe Evaluation

Co-occurrence ratios

GloVe targets the ratios between co-occurrence counts as the target forlearning. Intuitively, this captures context differences, rather than focusingon similarities.

Fei-Tzin Lee Word embeddings September 23, 2019 6 / 11

Page 67: W4705: Natural Language Processing Fei-Tzin Leekathy/NLP/2019/9-23-slides.pdfFigure: Output from the gensim package using word2vec vectors pretrained on Google News. This is not a

Appendix GloVe Evaluation

So what’s our model this time?

We specify some desirable properties that will let us hone in on the

appropriate model forPik

Pjk

• First: Enforce linear vector differences - the model should depend onlyon wi − wj

F (wi − wj , ck) =Pik

Pjk

• Second: Require that the model be bilinear in wi − wj and ck ; choosethe dot product for simplicity

G ((wi − wj)T ck) = G (wT

i ck − wTj ck) =

Pik

Pjk

Fei-Tzin Lee Word embeddings September 23, 2019 7 / 11

Page 68: W4705: Natural Language Processing Fei-Tzin Leekathy/NLP/2019/9-23-slides.pdfFigure: Output from the gensim package using word2vec vectors pretrained on Google News. This is not a

Appendix GloVe Evaluation

Constructing the GloVe model

But, wait - in GloVe’s setting, words and contexts are symmetric.

• We’d like the ik term to depend on the wTi ck , and likewise with jk in

the denominator; we can do this by setting

G ((wi − wj)T ck) =

G (wTi ck)

G (wTj ck)

.

• This gives us G = exp, or

wTi ck = logPik = log(Xik)− log(Xi ).

Then we can symmetrize by adding a free bias term wk to mop upthe log(Xi ).

Fei-Tzin Lee Word embeddings September 23, 2019 8 / 11

Page 69: W4705: Natural Language Processing Fei-Tzin Leekathy/NLP/2019/9-23-slides.pdfFigure: Output from the gensim package using word2vec vectors pretrained on Google News. This is not a

Appendix GloVe Evaluation

GloVe’s objective

All in all, this works out to a log-bilinear model with least-squares

J =∑w ,c

f (Xij)(〈~w , ~c〉+ bw + bc − log(Xij))2,

where f is a weighting function that we can choose.

Fei-Tzin Lee Word embeddings September 23, 2019 9 / 11

Page 70: W4705: Natural Language Processing Fei-Tzin Leekathy/NLP/2019/9-23-slides.pdfFigure: Output from the gensim package using word2vec vectors pretrained on Google News. This is not a

Appendix GloVe Evaluation

How do we know if word embeddings are good?

Two main flavors of evaluation: intrinsic and extrinsic measures.

• Intrinsic evaluation: word similarity or analogy tasks

• Extrinsic evaluation: performance on downstream tasks

Fei-Tzin Lee Word embeddings September 23, 2019 10 / 11

Page 71: W4705: Natural Language Processing Fei-Tzin Leekathy/NLP/2019/9-23-slides.pdfFigure: Output from the gensim package using word2vec vectors pretrained on Google News. This is not a

Appendix GloVe Evaluation

Issues with intrinsic evaluation

• Similarity and analogy tasks may be too homogeneous

• Intrinsic and extrinsic measures may not correlate

Fei-Tzin Lee Word embeddings September 23, 2019 11 / 11