word2vec: from intuition to practice using gensim

WORD2VECFROM INTUITION TO PRACTICE USING GENSIM

Edgar [email protected]

Python Peru MeetupSeptember 1st, 2016Lima - Perú

About Edgar Marca

# Software Engineer at Love Mondays.# One of the organizers of Data Science Lima Meetup.# Machine Learning and Data Science enthusiasm.# Eu falo um pouco de Português.

1

DATA SCIENCE LIMA MEETUP

Data Science Lima Meetup

Datos

# 5 Meetups y el 6to a la vuelta de la esquina# 410 Datanautas en el Grupo de Meetup.# 329 Personas en el Grupo de Facebook.

Organizadores

# Manuel Solorzano.# Dennis Barreda.# Freddy Cahuas.# Edgar Marca

3

Data Science Lima Meetup

Figure: Foto del quinto Data Science Lima Meetup.4

Data Never Sleeps

Figure: How much data is generated every minute? 1

1Data Never Sleeps 3.0https://www.domo.com/blog/2015/08/data-never-sleeps-3-0/

6

https://www.domo.com/blog/2015/08/data-never-sleeps-3-0/

NATURAL LANGUAGE PROCESSING

Introduction

# Text is the core business of internet companies today.# Machine Learning and natural language processing

techniques are applied to big datasets to improve search,ranking and many other tasks (spam detection, adsrecomendations, email categorization, machine translation,speech recognition, etc)

8

Natural Language Processing

Problems with text

# Messy.# Irregularities of the language.# Hierarchically.# Sparse Nature.

9

REPRESENTATIONS FOR TEXTS

Contextual Representation

11

How to Learn good representations?

12

One-hot Representation

One-hot encoding

Represent every word as an R|V | vector with all 0s and 1 at theindex of that word.

13


EXAMPLE

Example:

Let V = {the, hotel, nice,motel}

wthe =

1

0

0

0

,whotel =

0

1

0

0

,wnice =

0

0

1

0

,wmotel =

0

0

0

1

We represent each word as a completely independent entity.This word representation does not give us directly any notion ofsimilarity.

14


For instance

⟨whotel ,wmotel⟩R4 = 0 (1)

⟨whotel ,wcat⟩R4 = 0 (2)

we can try to reduce the size of this space from R4 to somethingsmaller and find a subspace that encodes the relationshipsbetween words.

15


Problems

# The dimension depends on the vocabulary size.# Leads to data sparsity. So we need more data.# Provide not useful information to the system.# Encondings are arbitrary.

16

Bag-of-words representation

# Sum of one-hot codes.# Ignores orders or words.

Examples:

# vocabulary = (monday, tuesday, is, a, today)# Monday Monday = [2, 0, 0, 0, 0]# today is monday = [1 0 1 1 1]# today is tuesday = [0 1 1 1 1]# is a monday today = [1 0 1 1 1]

17

Distributional hypotesis

You shall know a word by the company it keeps!Firth (1957)

18

Language Modeling (Unigrams, Bigrams, etc)

A language model is a probabilistic model that assignsprobability to any sequence of n words P(w1 ,w2 , . . . , wn)

Unigrams

Assuming that the word ocurrences are completely independent

P(w1 ,w2 , . . . , wn) = Πni=1P(wi) (3)

19

Language Modeling (Unigrams, Bigrams, etc)

Bigrams

The probability of the sequence depend on the pairwise prob-ability of a word in the sequence and the word next to it.

P(w1 ,w2 , . . . , wn) = Πni=2P(wi | wi−1) (4)

20

Word Embeddings

Word Embeddings

A set of language modeling and feature learning techniques inNLP where words or phrases from the vocabulary are mappedto vectors of real numbers in a low-dimensional space relativeto the vocabulary size (”continuous space”).

# Vector space models (VSMs) represent (embed) words in acontinous vector space.

# Semantically similar words are mapped to nearby points.# Basic idea is Distributional Hypothesis: words that appear

in the same context share semantic meaning.

21

WORD2VEC

Distributional hypotesis

You shall know a word by the company it keeps!Firth (1957)

23

Word2Vec

Figure: Two original papers published in association with word2vecby Mikolov et al. (2013)

# Efficient Estimation of Word Representations in VectorSpace https://arxiv.org/abs/1301.3781.

# Distributed Representations of Words and Phrases andtheir Compositionality https://arxiv.org/abs/1310.4546. 24

https://arxiv.org/abs/1301.3781

https://arxiv.org/abs/1310.4546

Continuous Bag of Words and Skip-gram

25


Word is represented by context in use

26


27

Word Vectors

28

Word Vectors

29

Word Vectors

30

Word Vectors

31

Word2Vec

# vking − vman + vwoman ≈ vqueen

# vparis − vfrance + vitaly ≈ vrome

# Learns from raw text# Huge splash in NLP world.# Comes pretrained. (If you don’t have any specialize

vocabulary)# Word2vec is computationally efficient model for learning

word embeddings.# Word2Vec is a successful example of ”shallow” learning.# Very simple Feedforward neural network with single hidden

layer, backpropagation, and no non-linearities.32

Word2vec

33

Gensim

34

APPLICATIONS

What the Fuck Are Trump Supporters Thinking?

36


37


# They gathered four million tweets belonging to more thantwo thousand hard-core Trump supporters.

# Distances between those vectors encoded the semanticdistance between their associated words (e.g. the vectorrepresentation of the word morons was near idiots but faraway from funny)

Link: https://medium.com/adventurous-social-science/what-the-fuck-are-trump-supporters-thinking-ecc16fb66a8d

38

https://medium.com/adventurous-social-science/what-the-fuck-are-trump-supporters-thinking-ecc16fb66a8d

https://medium.com/adventurous-social-science/what-the-fuck-are-trump-supporters-thinking-ecc16fb66a8d

Restaurant Recomendation.

http://www.slideshare.net/SudeepDasPhD/recsys-2015-making-meaningful-restaurant-recommendations-at-opentable

39



Restaurant Recomendation.


40



Song Recomendations

Link: https://social.shorthand.com/mawsonguy/3CfQA8mj2S/playlist-harvesting 41

https://social.shorthand.com/mawsonguy/3CfQA8mj2S/playlist-harvesting

https://social.shorthand.com/mawsonguy/3CfQA8mj2S/playlist-harvesting

TAKEAWAYS

Takeaways

# If you don’t have enough data you can use pre-trainedmodels.

# Remember: Garbage in, garbage out.# Every data set will come out with diferent results.# Use Word2vec as feature extractor.

43

Obrigado

45