lda2vec text by the bay 2016 with notes

lda2vec (word2vec, and lda)

Christopher Moody @ Stitch Fix

Welcome, thanks for coming, having me, organizer

NLP can be a messy affair because you have to teach a computer about the irregularities and ambiguities of the English language

and have to teach it this sort of hierarchical & sparse nature of english grammar & vocab

3rd trimester, pregnant“wears scrubs” — medicinetaking a trip — a fix for vacation clothing

power and promise of word vectors is to sweep away a lot of issues

About

@chrisemoody Caltech Physics PhD. in astrostats supercomputing sklearn t-SNE contributor Data Labs at Stitch Fix github.com/cemoody

Gaussian Processes t-SNE

chainer deep learning

Tensor Decomposition

https://twitter.com/chrisemoody

http://github.com/cemoody

word2vec

lda

1

23ld

a2vec

1. king - man + woman = queen 2. Huge splash in NLP world 3. Learns from raw text 4. Pretty simple algorithm 5. Comes pretrained

word2vec

1. Learns what words mean — can solve analogies cleanly.1. Not treating words as blocks, but instead modeling relationships

2. Distributed representations form the basis of more complicated deep learning systems3.

1. Set up an objective function 2. Randomly initialize vectors 3. Do gradient descent

word2vec

word

2vec

word2vec: learn word vector w from it’s surrounding context

w

1. not mention neural networks2. Let’s talk about training first3. n-grams transition probability vs tf-idf / LSI co-occurence matrices4. Here we will learn the embedded representation directly, with no intermediates, update it w/ every example

word

2vec

“The fox jumped over the lazy dog”Maximize the likelihood of seeing the words given the word over.

P(the|over) P(fox|over)

P(jumped|over) P(the|over) P(lazy|over) P(dog|over)

…instead of maximizing the likelihood of co-occurrence counts.

1. Context — the words surrounding the training word2. Naively assume, BoW, not recurrent, no state3. Still a pretty simple assumption!

Conditioning on just *over* no other secret parameters or anything

word

2vec

P(fox|over)

What should this be?

word

2vec

P(vfox|vover)

Should depend on the word vectors.

P(fox|over)

Trying to learn the word vectors, so let’s start with those(we’ll randomly initialize them to begin with)

word

2vec

“The fox jumped over the lazy dog”

P(w|c)

Extract pairs from context window around every input word.

word

2vec


c

P(w|c)


IN = training word = context

word

2vec


w

P(w|c)

c


word

2vec

P(w|c)

w c



word

2vec


P(w|c)

w c


word

2vec

P(w|c)

c w



word

2vec

P(w|c)

c w



inner most for-loopv_in was fix over the for loopincrement v_in to point to the next word

word

2vec

P(w|c)

w c



word

2vec

P(w|c)

cw



…So that at a high level is what we want word2vec to do.

word

2vec

P(w|c)

c w




word

2vec

P(w|c)

c w



called skip grams


two for loops

objectiv

e

Measure loss between w and c?

How should we define P(w|c)?

Now we’ve defined the high-level update path for the algorithm.

Need to define this prob exactly in order to define our updates.

Boils down to diff between word & context — want to make as similar as possible, and then the probability will go up.

objectiv

e

w . c

How should we define P(w|c)?

Measure loss between w and c?

Use cosine sim.

could imagine euclidean dist, mahalonobis dist

word

2vec

w . c ~ 1

objectiv

e

w

c

vcanada . vsnow ~ 1

Dot product has these properties:Similar vectors have similarly near 1 (if normed)

word

2vec

w . c ~ 0

objectiv

e

w

cvcanada . vdesert ~0

Orthogonal vectors have similarity near 0

word

2vec

w . c ~ -1

objectiv

e

w

c

Orthogonal vectors have similarity near -1

word

2vec

w . c ∈ [-1,1]

objectiv

e

But the inner product ranges from -1 to 1 (when normalized)

word

2vec

But we’d like to measure a probability.

w . c ∈ [-1,1]

objectiv

e

…and we’d like a probability

word

2vec


objectiv

e

∈ [0,1]σ(c·w)

Transform again using sigmoid

word

2vec


objectiv

e

∈ [0,1]σ(c·w)

w c

w c

SimilarDissimilar

Transform again using sigmoid

word

2vec

Loss function:

objectiv

e

L=σ(c·w)

Logistic (binary) choice. Is the (context, word) combination from our dataset?

Are these 2 similar?

This is optimized with identical vectors… everything looks exactly the same

word

2vec

The skip-gram negative-sampling model

objectiv

e

Trivial solution is that context = word for all vectors

L=σ(c·w)w

c

no contrast! add negative samples

word

2vec


L = σ(c·w) + σ(-c·wneg)

objectiv

e

Draw random words in vocabulary.

no contrast! add negative samples

discrimanate that this (w, c) is from observed vocabulary, this is a randomly drawn word

example: (fox, jumped) but (-fox, career)

no popularity guessing as in softmax

word

2vec


objectiv

e

Discriminate positive from negative samples

Multiple Negative

L = σ(c·w) + σ(-c·wneg) +…+ σ(-c·wneg)

mult samples(fox, jumped)

not:(fox, career)(fox, android)(fox, sunlight)

remind: this L is being computed, and we’ll nudge all of the values of c & w via GD to optimize L

so that’s SGNS / w2vThat’s it! A bit disservice to make this a giant network

word

2vec

The SGNS ModelPMI

ci·wj = PMI(Mij) - log k

…is extremely similar to matrix factorization!

Levy & Goldberg 2014


explain what matrix is, row, col, entries

e.g. can solve word2vec via SVD if we want deterministically

One of the most cited NLP papers of 2014

word

2vec

The SGNS ModelPMI


‘traditional’ NLP


ci·wj = PMI(Mij) - log k

…is extremely similar to matrix factorization!

Dropped indices for readability

most cited because connection to PMIInfo-theoretic measure association of (w and c)

word

2vec

The SGNS Model

L = σ(c·w) + Σσ(-c·w)

PMI

ci·wj = log


#(ci,wj)/n

k #(wj)/n #(ci)/n


instead of looping over all words, you can count and roll them up the way

props are just counts divided by number of obs

word2vec adds this bias term

cool that all of for looping did before can be reduced to this form

word

2vec

The SGNS Model

L = σ(c·w) + Σσ(-c·w)

PMI

ci·wj = log


popularity of c,wk (popularity of c) (popularity of w)


props are just counts divided by number of obs

word2vec adds this bias term

down weights rare terms

More frequent words are weighted more so than infrequent words. Rare words are downweighted.

Make sense given that words distribution follows Zipf law.

This weighting is what makes SGNS so powerful.

theoretical is nice, but practical is even better

word

2vec

PMI

99% of word2vec is counting.

And you can count words in SQL

word

2vec

PMI

Count how many times you saw c·w

Count how many times you saw c

Count how many times you saw w

word

2vec

PMI

…and this takes ~5 minutes to compute on a single core. Computing SVD is a completely standard math library.

word2vec

explain table

intuition about word vectors

Showing just 2 of the ~500 dimensions. Effectively we’ve PCA’d itonly 4 of 100k words

If we only had locality and not regularity, this wouldn’t necessarily be true

So we live in a vector space where operations like addition and subtraction are semantically meaningful.

So here’s a few examples of this working.

Really get the idea of these vectors as being ‘mixes’ of other ideas & vectors

ITEM_3469 + ‘Pregnant’

SF is a person service

Box

+ ‘Pregnant’

I love the stripes and the cut around my neckline was amazing

someone else might write ‘grey and black’

subtlety and nuance in that language

For some items, we have several times the collected works of shakespeare

= ITEM_701333 = ITEM_901004 = ITEM_800456

Stripes and are safe for maternityAnd also similar tones and flowy — still great for expecting mothers

what about?LDA?

LDA on Client Item Descriptions

This shows the incredible amount of structure

LDA on Item

Descriptions (with Jay)

clunky jewelrydangling delicate jewelry elsewhere

LDA on Item


topics on patterns, styles — this cluster is similarly described as high contrast tops with popping colors

LDA on Item


bright dresses for a warm summer

LDA helps us model topics over documents in an interpretable way

lda vs word2vec

Bayesian Graphical ModelML Neural Model

word2vec is local: one word predicts a nearby word

“I love finding new designer brands for jeans”

as if the world where one very long text string. no end of documents, no end of sentence, etc.

and a window across words


But text is usually organized.

as if the world where one very long text string. no end of documents, no end of sentence, etc.


In LDA, documents globally predict words.

doc 7681

these are client comment which are short, only predict dozens of words

but could be legal documents, or medical documents, 10k words

typical word2vec vector

[ 0%, 9%, 78%, 11%]

typical LDA document vector

[ -0.75, -1.25, -0.55, -0.12, +2.2]

All sum to 100%All real values

5D word2vec vector

[ 0%, 9%, 78%, 11%]

5D LDA document vector

[ -0.75, -1.25, -0.55, -0.12, +2.2]

Sparse All sum to 100%

Dimensions are absolute

Dense All real values

Dimensions relative

much easier to say to another human 78% than it is +2.2 of something and -1.25 of something else

w2v an address — 200 main st. — figure out from neighbors

LDA is a *mixture* model

78% of some ingredient

but

w2v isn’t -1.25 of some ingredient

ingredient = topics

100D word2vec vector

[ 0%0%0%0%0% … 0%, 9%, 78%, 11%]


[ -0.75, -1.25, -0.55, -0.27, -0.94, 0.44, 0.05, 0.31 … -0.12, +2.2]

Sparse All sum to 100%

Dimensions are absolute

Dense All real values

Dimensions relative

dense sparse

lda is sparse, has 95 dims close to zero

100D word2vec vector

[ 0%0%0%0%0% … 0%, 9%, 78%, 11%]


[ -0.75, -1.25, -0.55, -0.27, -0.94, 0.44, 0.05, 0.31 … -0.12, +2.2]

Similar in fewer ways (more interpretable)

Similar in 100D ways (very flexible)

+mixture +sparse

can we do both? lda2vec

series of expgrain of salt

-1.9 0.85 -0.6 -0.3 -0.5

Lufthansa is a German airline and when

fox

#hidden units

Skip grams from sentences

Word vector

Negative sampling loss


German

word2vec predicts locally: one word predicts a nearby word

read example

We extract pairs of pivot and target words that occur in a moving window that scans across the corpus. For every pair, the pivot word is used to predict the nearby target word.

0.34 -0.1 0.17

41% 26% 34%

-1.4 -0.5 -1.4

-1.9-1.7 0.75

0.96-0.7 -1.9

-0.2-1.1 0.6

-0.7 -0.4 -0.7 -0.3 -0.3-1.9 0.85 -0.6 -0.3 -0.5

-2.6 0.45 -1.3 -0.6 -0.8


#topics

#topicsfox

#hidden units

#topics

#hidden units#hidden units

#hidden units


Word vector


Topic matrix

Document proportion

Document weight

Document vector

Context vector

x

+


German

Document vector predicts a word from

a global context

0.34 -0.1 0.17

41% 26% 34%

-1.4 -0.5 -1.4

-1.9-1.7 0.75

0.96-0.7 -1.9

-0.2-1.1 0.6

-0.7 -0.4 -0.7 -0.3 -0.3-1.9 0.85 -0.6 -0.3 -0.5

-2.6 0.45 -1.3 -0.6 -0.8


#topics

#topicsfox

#hidden units

#topics


#hidden units


Word vector


Topic matrix

Document proportion

Document weight

Document vector

Context vector

x

+


German: French or Spanish+airlines a document vector similar to the word vector for airline.

we know we can add vectors

German + airline: like Lufthansa, Condor Flugdienst, and Aero Lloyd.

A latent vector is randomly initialized for every document in the corpus. Very similar to doc2vec and paragraph vectors.

0.34 -0.1 0.17

41% 26% 34%

-1.4 -0.5 -1.4

-1.9-1.7 0.75

0.96-0.7 -1.9

-0.2-1.1 0.6

-0.7 -0.4 -0.7 -0.3 -0.3-1.9 0.85 -0.6 -0.3 -0.5

-2.6 0.45 -1.3 -0.6 -0.8


#topics

#topicsfox

#hidden units

#topics


#hidden units


Word vector


Topic matrix

Document proportion

Document weight

Document vector

Context vector

x

+


We’re missing mixtures & sparsity!

German

Good for training sentiment models, great scores.

interpretability

0.34 -0.1 0.17

41% 26% 34%

-1.4 -0.5 -1.4

-1.9-1.7 0.75

0.96-0.7 -1.9

-0.2-1.1 0.6

-0.7 -0.4 -0.7 -0.3 -0.3-1.9 0.85 -0.6 -0.3 -0.5

-2.6 0.45 -1.3 -0.6 -0.8


#topics

#topicsfox

#hidden units

#topics


#hidden units


Word vector


Topic matrix

Document proportion

Document weight

Document vector

Context vector

x

+


We’re missing mixtures & sparsity!

German

Too many documents.

about as interpretable a hash

0.34 -0.1 0.17

41% 26% 34%

-1.4 -0.5 -1.4

-1.9-1.7 0.75

0.96-0.7 -1.9

-0.2-1.1 0.6

-0.7 -0.4 -0.7 -0.3 -0.3-1.9 0.85 -0.6 -0.3 -0.5

-2.6 0.45 -1.3 -0.6 -0.8


#topics

#topicsfox

#hidden units

#topics


#hidden units


Word vector


Topic matrix

Document proportion

Document weight

Document vector

Context vector

x

+


Now it’s a mixture.

0.34 -0.1 0.17

41% 26% 34%

-1.4 -0.5 -1.4

-1.9-1.7 0.75

0.96-0.7 -1.9

-0.2-1.1 0.6

-0.7 -0.4 -0.7 -0.3 -0.3-1.9 0.85 -0.6 -0.3 -0.5

-2.6 0.45 -1.3 -0.6 -0.8


#topics

#topicsfox

#hidden units

#topics


#hidden units


Word vector


Topic matrix

Document proportion

Document weight

Document vector

Context vector

x

+


German

document X is +0.34 in topic 0, -0.1 in topic 2 and 0.17 in topic 3

model with 3 topics.

before had 500 DoF; now has just a few.

better choose really good topics, because I only have a few available to summarize the entire document

0.34 -0.1 0.17

41% 26% 34%

-1.4 -0.5 -1.4

-1.9-1.7 0.75

0.96-0.7 -1.9

-0.2-1.1 0.6

-0.7 -0.4 -0.7 -0.3 -0.3-1.9 0.85 -0.6 -0.3 -0.5

-2.6 0.45 -1.3 -0.6 -0.8


#topics

#topicsfox

#hidden units

#topics


#hidden units


Word vector


Topic matrix

Document proportion

Document weight

Document vector

Context vector

x

+


Trinitarian baptismal

Pentecostals Bede

schismatics excommunication

0.34 -0.1 0.17

41% 26% 34%

-1.4 -0.5 -1.4

-1.9-1.7 0.75

0.96-0.7 -1.9

-0.2-1.1 0.6

-0.7 -0.4 -0.7 -0.3 -0.3-1.9 0.85 -0.6 -0.3 -0.5

-2.6 0.45 -1.3 -0.6 -0.8


#topics

#topicsfox

#hidden units

#topics


#hidden units


Word vector


Topic matrix

Document proportion

Document weight

Document vector

Context vector

x

+


0.34 -0.1 0.17

41% 26% 34%

-1.4 -0.5 -1.4

-1.9-1.7 0.75

0.96-0.7 -1.9

-0.2-1.1 0.6

-0.7 -0.4 -0.7 -0.3 -0.3-1.9 0.85 -0.6 -0.3 -0.5

-2.6 0.45 -1.3 -0.6 -0.8


#topics

#topicsfox

#hidden units

#topics


#hidden units


Word vector


Topic matrix

Document proportion

Document weight

Document vector

Context vector

x

+


0.34 -0.1 0.17

#topicsDocument weight

Each topic has a distributed representation that lives in the same space as the word vectors. While each topic is not literally a token present in the corpus, it is similar to other tokens.

0.34 -0.1 0.17

41% 26% 34%

-1.4 -0.5 -1.4

-1.9-1.7 0.75

0.96-0.7 -1.9

-0.2-1.1 0.6

-0.7 -0.4 -0.7 -0.3 -0.3-1.9 0.85 -0.6 -0.3 -0.5

-2.6 0.45 -1.3 -0.6 -0.8


#topics

#topicsfox

#hidden units

#topics


#hidden units


Word vector


Topic matrix

Document proportion

Document weight

Document vector

Context vector

x

+


topic 1 = “religion” Trinitarian baptismal

Pentecostals Bede

schismatics excommunication

0.34 -0.1 0.17

41% 26% 34%

-1.4 -0.5 -1.4

-1.9-1.7 0.75

0.96-0.7 -1.9

-0.2-1.1 0.6

-0.7 -0.4 -0.7 -0.3 -0.3-1.9 0.85 -0.6 -0.3 -0.5

-2.6 0.45 -1.3 -0.6 -0.8


#topics

#topicsfox

#hidden units

#topics


#hidden units


Word vector


Topic matrix

Document proportion

Document weight

Document vector

Context vector

x

+


0.34 -0.1 0.17

41% 26% 34%

-1.4 -0.5 -1.4

-1.9-1.7 0.75

0.96-0.7 -1.9

-0.2-1.1 0.6

-0.7 -0.4 -0.7 -0.3 -0.3-1.9 0.85 -0.6 -0.3 -0.5

-2.6 0.45 -1.3 -0.6 -0.8


#topics

#topicsfox

#hidden units

#topics


#hidden units


Word vector


Topic matrix

Document proportion

Document weight

Document vector

Context vector

x

+


0.34 -0.1 0.17


0.34 -0.1 0.17

41% 26% 34%

-1.4 -0.5 -1.4

-1.9-1.7 0.75

0.96-0.7 -1.9

-0.2-1.1 0.6

-0.7 -0.4 -0.7 -0.3 -0.3-1.9 0.85 -0.6 -0.3 -0.5

-2.6 0.45 -1.3 -0.6 -0.8


#topics

#topicsfox

#hidden units

#topics


#hidden units


Word vector


Topic matrix

Document proportion

Document weight

Document vector

Context vector

x

+


Milosevic absentee

Indonesia Lebanese Isrealis

Karadzic

0.34 -0.1 0.17

41% 26% 34%

-1.4 -0.5 -1.4

-1.9-1.7 0.75

0.96-0.7 -1.9

-0.2-1.1 0.6

-0.7 -0.4 -0.7 -0.3 -0.3-1.9 0.85 -0.6 -0.3 -0.5

-2.6 0.45 -1.3 -0.6 -0.8


#topics

#topicsfox

#hidden units

#topics


#hidden units


Word vector


Topic matrix

Document proportion

Document weight

Document vector

Context vector

x

+


0.34 -0.1 0.17

41% 26% 34%

-1.4 -0.5 -1.4

-1.9-1.7 0.75

0.96-0.7 -1.9

-0.2-1.1 0.6

-0.7 -0.4 -0.7 -0.3 -0.3-1.9 0.85 -0.6 -0.3 -0.5

-2.6 0.45 -1.3 -0.6 -0.8


#topics

#topicsfox

#hidden units

#topics


#hidden units


Word vector


Topic matrix

Document proportion

Document weight

Document vector

Context vector

x

+


0.34 -0.1 0.17


notice one column over in the topic matrix

0.34 -0.1 0.17

41% 26% 34%

-1.4 -0.5 -1.4

-1.9-1.7 0.75

0.96-0.7 -1.9

-0.2-1.1 0.6

-0.7 -0.4 -0.7 -0.3 -0.3-1.9 0.85 -0.6 -0.3 -0.5

-2.6 0.45 -1.3 -0.6 -0.8


#topics

#topicsfox

#hidden units

#topics


#hidden units


Word vector


Topic matrix

Document proportion

Document weight

Document vector

Context vector

x

+


topic 2 = “politics” Milosevic absentee

Indonesia Lebanese Isrealis

Karadzic

0.34 -0.1 0.17

41% 26% 34%

-1.4 -0.5 -1.4

-1.9-1.7 0.75

0.96-0.7 -1.9

-0.2-1.1 0.6

-0.7 -0.4 -0.7 -0.3 -0.3-1.9 0.85 -0.6 -0.3 -0.5

-2.6 0.45 -1.3 -0.6 -0.8


#topics

#topicsfox

#hidden units

#topics


#hidden units


Word vector


Topic matrix

Document proportion

Document weight

Document vector

Context vector

x

+


0.34 -0.1 0.17

41% 26% 34%

-1.4 -0.5 -1.4

-1.9-1.7 0.75

0.96-0.7 -1.9

-0.2-1.1 0.6

-0.7 -0.4 -0.7 -0.3 -0.3-1.9 0.85 -0.6 -0.3 -0.5

-2.6 0.45 -1.3 -0.6 -0.8


#topics

#topicsfox

#hidden units

#topics


#hidden units


Word vector


Topic matrix

Document proportion

Document weight

Document vector

Context vector

x

+


0.34 -0.1 0.17


topic vectors, document vectors, and word vectors all live in the same space

0.34 -0.1 0.17

41% 26% 34%

-1.4 -0.5 -1.4

-1.9-1.7 0.75

0.96-0.7 -1.9

-0.2-1.1 0.6

-0.7 -0.4 -0.7 -0.3 -0.3-1.9 0.85 -0.6 -0.3 -0.5

-2.6 0.45 -1.3 -0.6 -0.8


#topics

#topicsfox

#hidden units

#topics


#hidden units


Word vector


Topic matrix

Document proportion

Document weight

Document vector

Context vector

x

+


0.34 -0.1 0.17

41% 26% 34%

-1.4 -0.5 -1.4

-1.9-1.7 0.75

0.96-0.7 -1.9

-0.2-1.1 0.6

-0.7 -0.4 -0.7 -0.3 -0.3-1.9 0.85 -0.6 -0.3 -0.5

-2.6 0.45 -1.3 -0.6 -0.8


#topics

#topicsfox

#hidden units

#topics


#hidden units


Word vector


Topic matrix

Document proportion

Document weight

Document vector

Context vector

x

+


0.34 -0.1 0.17

41% 26% 34%

-1.4 -0.5 -1.4

-1.9-1.7 0.75

0.96-0.7 -1.9

-0.2-1.1 0.6

-0.7 -0.4 -0.7 -0.3 -0.3-1.9 0.85 -0.6 -0.3 -0.5

-2.6 0.45 -1.3 -0.6 -0.8


#topics

#topicsfox

#hidden units

#topics


#hidden units


Word vector


Topic matrix

Document proportion

Document weight

Document vector

Context vector

x

+


0.34 -0.1 0.17

41% 26% 34%

-1.4 -0.5 -1.4

-1.9-1.7 0.75

0.96-0.7 -1.9

-0.2-1.1 0.6

-0.7 -0.4 -0.7 -0.3 -0.3-1.9 0.85 -0.6 -0.3 -0.5

-2.6 0.45 -1.3 -0.6 -0.8


#topics

#topicsfox

#hidden units

#topics


#hidden units


Word vector


Topic matrix

Document proportion

Document weight

Document vector

Context vector

x

+


The document weights are softmax transformed weights to yield the document proportions.

similar to logistic, but instead of 0-1 now a vectors of %

100% and indicates the topic proportions of a single document

one document might be 41% in topic 0, 26% in topic 1, and 34% in topic 3

very close to LDA-like representations

0.34 -0.1 0.17

41% 26% 34%

-1.4 -0.5 -1.4

-1.9-1.7 0.75

0.96-0.7 -1.9

-0.2-1.1 0.6

-0.7 -0.4 -0.7 -0.3 -0.3-1.9 0.85 -0.6 -0.3 -0.5

-2.6 0.45 -1.3 -0.6 -0.8


#topics

#topicsfox

#hidden units

#topics


#hidden units


Word vector


Topic matrix

Document proportion

Document weight

Document vector

Context vector

x

+


0.34 -0.1 0.17

41% 26% 34%

-1.4 -0.5 -1.4

-1.9-1.7 0.75

0.96-0.7 -1.9

-0.2-1.1 0.6

-0.7 -0.4 -0.7 -0.3 -0.3-1.9 0.85 -0.6 -0.3 -0.5

-2.6 0.45 -1.3 -0.6 -0.8


#topics

#topicsfox

#hidden units

#topics


#hidden units


Word vector


Topic matrix

Document proportion

Document weight

Document vector

Context vector

x

+


1st time i did this, still very dense in percentages. so if 100 topics, had 1% each

might be, but still dense!

a zillion categories, 1% of this, 1% of this, 1% of this…

mathematically works, addition of lots of little bits (distributed)

0.34 -0.1 0.17

41% 26% 34%

-1.4 -0.5 -1.4

-1.9-1.7 0.75

0.96-0.7 -1.9

-0.2-1.1 0.6

-0.7 -0.4 -0.7 -0.3 -0.3-1.9 0.85 -0.6 -0.3 -0.5

-2.6 0.45 -1.3 -0.6 -0.8


#topics

#topicsfox

#hidden units

#topics


#hidden units


Word vector


Topic matrix

Document proportion

Document weight

Document vector

Context vector

x

+


Sparsity!

0.34 -0.1 0.17

41% 26% 34%

-1.4 -0.5 -1.4

-1.9-1.7 0.75

0.96-0.7 -1.9

-0.2-1.1 0.6

-0.7 -0.4 -0.7 -0.3 -0.3-1.9 0.85 -0.6 -0.3 -0.5

-2.6 0.45 -1.3 -0.6 -0.8


#topics

#topicsfox

#hidden units

#topics


#hidden units


Word vector


Topic matrix

Document proportion

Document weight

Document vector

Context vector

x

+


34% 32% 34%

t=0

41% 26% 34%

t=10

99% 1% 0%

t=∞

time

init balanced, but dense

Dirichlet likelihood loss encourages proportions vectors to become sparser over time.

relatively simple

start pink, go to white and sparse

0.34 -0.1 0.17

41% 26% 34%

-1.4 -0.5 -1.4

-1.9-1.7 0.75

0.96-0.7 -1.9

-0.2-1.1 0.6

-0.7 -0.4 -0.7 -0.3 -0.3-1.9 0.85 -0.6 -0.3 -0.5

-2.6 0.45 -1.3 -0.6 -0.8


#topics

#topicsfox

#hidden units

#topics


#hidden units


Word vector


Topic matrix

Document proportion

Document weight

Document vector

Context vector

x

+


German

end up with something that’s quite a bit more complicated

but it achieves our goals: mixes word vectors w/ sparse interpretable document representations

@chrisemoody Example Hacker News comments

Word vectors: https://github.com/cemoody/

lda2vec/blob/master/examples/hacker_news/lda2vec/

word_vectors.ipynb

read examples

play around at home — on


https://github.com/cemoody/lda2vec/blob/master/examples/hacker_news/lda2vec/word_vectors.ipynb

@chrisemoody Example Hacker News comments

Topics: http://nbviewer.jupyter.org/github/cemoody/lda2vec/blob/master/examples/

hacker_news/lda2vec/lda2vec.ipynb

http://nbviewer.jupyter.org/github/cemoody/lda2vec/blob/master/examples/hacker_news/lda2vec/lda2vec.ipynb

topic 16 — sci, phystopic 1 — housingtopic 8 — finance, bitcointopic 23 — programming languages

topic 6 — transportationtopic 3 — education


http://nbviewer.jupyter.org/github/cemoody/lda2vec/blob/master/examples/hacker_news/lda2vec/lda2vec.ipynb

@chrisemoody

lda2vec.com

http://nbviewer.jupyter.org/github/cemoody/lda2vec/blob/master/examples/twenty_newsgroups/lda.ipynb


http://lda2vec.com

+ API docs + Examples + GPU + Tests

@chrisemoody

lda2vec.com



http://lda2vec.com

@chrisemoody

lda2vec.com

human-interpretable doc topics, use LDA.

machine-useable word-level features, use word2vec.

if you like to experiment a lot, and have topics over user / doc / region / etc. features, use lda2vec. (and you have a GPU)

If you want…



http://lda2vec.com

?@chrisemoody

Multithreaded Stitch Fix


@chrisemoody

lda2vec.com



http://lda2vec.com

CreditLarge swathes of this talk are from

previous presentations by:

• Tomas Mikolov • David Blei • Christopher Olah • Radim Rehurek • Omer Levy & Yoav Goldberg • Richard Socher • Xin Rong • Tim Hopper

Richar & Xin Rong both ludic explanations of the word2vec gradient

http://www.coling-2014.org/COLING%202014%20Tutorial-fix%20-%20Tomas%20Mikolov.pdf

http://colah.github.io/posts/2014-07-NLP-RNNs-Representations/

http://radimrehurek.com/2014/12/making-sense-of-word2vec/

http://web.engr.illinois.edu/~khashab2/files/2014_presentations/2014_acl_goldberg.pptx

http://cs224d.stanford.edu/syllabus.html

http://www-personal.umich.edu/~ronxin/pdf/w2vexp.pdf

“PS! Thank you for such an awesome idea”

@chrisemoody

doc_id=1846

Can we model topics to sentences? lda2lstm

Data Labs @ SF is all about mixing cutting edge algorithms but we absolutely need interpretability.

initial vector is a dirichlet mixture — moves us from bag of words to sentence-level LDA

give a sent that’s 80% religion, 10% politics

word2vec on word level, LSTM on the sentence level, LDA on document level

Dirichlet-squeeze internal states and manipulations, that’ll help us understand the science of LSTM dynamics


Can we model topics to sentences? lda2lstm

“PS! Thank you for such an awesome idea”doc_id=1846

@chrisemoody

Can we model topics to images? lda2ae

TJ Torres


and now for something completely crazy4Fun Stuff

translation

(using just a rotation matrix)

Miko

lov

2013

English

Spanish

Matrix Rotation

Blow mind

Explain plot

Not a complicated NN here

Still have to learn the rotation matrix — but it generalizes very nicely.

Have analogies for every linalg op as a linguistic operator

Robust framework and tools to do science on words

deepwalk

Perozz

i

et al 2

014

learn word vectors from sentences


vOUT vOUT vOUT vOUT vOUTvOUT

‘words’ are graph vertices ‘sentences’ are random walks on the graph

word2vec

Playlists at Spotify

context

sequence

lear

ning

‘words’ are song indices ‘sentences’ are playlists

Playlists at Spotify

contextErik

Bernhar

dsson

Great performance on ‘related artists’

Fixes at Stitch Fix

sequence

lear

ning

Let’s try: ‘words’ are items ‘sentences’ are fixes

Fixes at Stitch Fix

context

Learn similarity between styles because they co-occur

Learn ‘coherent’ styles

sequence

lear

ning

Fixes at Stitch Fix?

context

sequence

lear

ningGot lots of structure!


context

sequence

lear

ning


context

sequence

lear

ning

Nearby regions are consistent ‘closets’

What sorts of sequences do you have at Quora? What kinds of things can you learn from context?

?@chrisemoody



context dependent

Levy

& G

oldberg

2014

Australian scientist discovers star with telescopecontext +/- 2 words

context dependent

context

Australian scientist discovers star with telescope

Levy

& G

oldberg

2014

What if we

context dependent

context

Australian scientist discovers star with telescopecontext

Levy

& G

oldberg

2014

context dependent

context

BoW DEPS

topically-similar vs ‘functionally’ similar

Levy

& G

oldberg

2014

?@chrisemoody



Crazy Approaches

Paragraph Vectors (Just extend the context window)

Content dependency (Change the window grammatically)

Social word2vec (deepwalk) (Sentence is a walk on the graph)

Spotify (Sentence is a playlist of song_ids)

Stitch Fix (Sentence is a shipment of five items)

See previous

CBOW


Guess the word given the context

~20x faster. (this is the alternative.)

vOUT

vIN vINvIN vINvIN vIN

SkipGram


vOUT vOUT

vIN

vOUT vOUT vOUTvOUT

Guess the context given the word

Better at syntax. (this is the one we went over)

CBOW sums words vectors, loses the order in the sentenceBoth are good at semantic relationships Child and kid are nearby Or gender in man, woman If you blur words over the scale of context — 5ish words, you lose a lot grammatical nuanceBut skipgram preserves order Preserves the relationship in pluralizing, for example

lda2

vec

vDOC = a vtopic1 + b vtopic2 +…

Let’s make vDOC sparse

Too many documents. I really like that document X is 70% in topic 0, 30% in topic1, …

lda2

vec

This works! 😀 But vDOC isn’t as interpretable as the topic vectors. 😔

vDOC = topic0 + topic1

Let’s say that vDOC ads

Too many documents. I really like that document X is 70% in topic 0, 30% in topic1, …

lda2

vec

softmax(vOUT * (vIN+ vDOC))

we want k *sparse* topics

Shows that are many words similar to vacation actually come in lots of flavors — wedding words (bachelorette, rehearsals)— holiday/event words (birthdays, brunch, christmas, thanksgiving)— seasonal words (spring, summer,)— trip words (getaway)— destinations

theory of lda2vec

lda2

vec

pyLDAvis of lda2vec

lda2

vec

LDA Results

context

History

I loved every choice in this fix!! Great job!

Great Stylist Perfect

There are k tags

Issues "Cancel Disappointed"Delivery "Profile, Pinterest""Weather Vacation" “Corrections for Next" "Wardrobe Mix" "Requesting Specific" "Requesting Department" "Requesting Style" "Style, Positive" "Style, Neutral"

LDA Results

context

History

Body Fit

My measurements are 36-28-32. If that helps. I like wearing some clothing that is fitted.

Very hard for me to find pants that fit right.

There are k tags


LDA Results

context

History

Sizing

Really enjoyed the experience and the pieces, sizing for tops was too big.

Looking forward to my next box!

Excited for next

There are k tags


LDA Results

context

History

Almost Bought

It was a great fix. Loved the two items I kept and the three I sent back were close!

Perfect

There are k tags


All of the following ideas will change what ‘words’ and ‘context’ represent.

But we’ll still use the same w2v algo

parag

raph

vecto

r

What about summarizing documents?

On the day he took office, President Obama reached out to America’s enemies, offering in his first inaugural address to extend a hand if you are willing to unclench your fist. More than six years later, he has arrived at a moment of truth in testing that


The framework nuclear agreement he reached with Iran on Thursday did not provide the definitive answer to whether Mr. Obama’s audacious gamble will pay off. The fist Iran has shaken at the so-called Great Satan since 1979 has not completely relaxed.

parag

raph

vecto

r

Normal skipgram extends C words before, and C words after.

IN

OUT OUT

Except we stay inside a sentence


The framework nuclear agreement he reached with Iran on Thursday did not provide the definitive answer to whether Mr. Obama’s audacious gamble will pay off. The fist Iran has shaken at the so-called Great Satan since 1979 has not completely relaxed.

parag

raph

vecto

r

A document vector simply extends the context to the whole document.

IN

OUT OUT

OUT OUTdoc_1347

fromgensim.modelsimportDoc2Vecfn=“item_document_vectors”model=Doc2Vec.load(fn)model.most_similar('pregnant')matches=list(filter(lambdax:'SENT_'inx[0],matches))

#['...Iamcurrently23weekspregnant...',#'...I'mnow10weekspregnant...',#'...notshowingtoomuchyet...',#'...15weeksnow.Babybump...',#'...6weekspostpartum!...',#'...12weekspostpartumandamnursing...',#'...Ihavemybabyshowerthat...',#'...amstillbreastfeeding...',#'...Iwouldloveanoutfitforababyshower...']

sente

nce

sear

ch

lda2vec text by the bay 2016 with notes

Technology