word2vec: from intuition to practice using gensim
TRANSCRIPT
WORD2VECFROM INTUITION TO PRACTICE USING GENSIM
Edgar [email protected]
Python Peru MeetupSeptember 1st, 2016Lima - Perú
About Edgar Marca
# Software Engineer at Love Mondays.# One of the organizers of Data Science Lima Meetup.# Machine Learning and Data Science enthusiasm.# Eu falo um pouco de Português.
1
DATA SCIENCE LIMA MEETUP
Data Science Lima Meetup
Datos
# 5 Meetups y el 6to a la vuelta de la esquina# 410 Datanautas en el Grupo de Meetup.# 329 Personas en el Grupo de Facebook.
Organizadores
# Manuel Solorzano.# Dennis Barreda.# Freddy Cahuas.# Edgar Marca
3
Data Science Lima Meetup
Figure: Foto del quinto Data Science Lima Meetup.4
DATA
Data Never Sleeps
Figure: How much data is generated every minute? 1
1Data Never Sleeps 3.0https://www.domo.com/blog/2015/08/data-never-sleeps-3-0/
6
NATURAL LANGUAGE PROCESSING
Introduction
# Text is the core business of internet companies today.# Machine Learning and natural language processing
techniques are applied to big datasets to improve search,ranking and many other tasks (spam detection, adsrecomendations, email categorization, machine translation,speech recognition, etc)
8
Natural Language Processing
Problems with text
# Messy.# Irregularities of the language.# Hierarchically.# Sparse Nature.
9
REPRESENTATIONS FOR TEXTS
Contextual Representation
11
How to Learn good representations?
12
One-hot Representation
One-hot encoding
Represent every word as an R|V | vector with all 0s and 1 at theindex of that word.
13
One-hot Representation
EXAMPLE
Example:
Let V = {the, hotel, nice,motel}
wthe =
1
0
0
0
,whotel =
0
1
0
0
,wnice =
0
0
1
0
,wmotel =
0
0
0
1
We represent each word as a completely independent entity.This word representation does not give us directly any notion ofsimilarity.
14
One-hot Representation
For instance
⟨whotel ,wmotel⟩R4 = 0 (1)
⟨whotel ,wcat⟩R4 = 0 (2)
we can try to reduce the size of this space from R4 to somethingsmaller and find a subspace that encodes the relationshipsbetween words.
15
One-hot Representation
Problems
# The dimension depends on the vocabulary size.# Leads to data sparsity. So we need more data.# Provide not useful information to the system.# Encondings are arbitrary.
16
Bag-of-words representation
# Sum of one-hot codes.# Ignores orders or words.
Examples:
# vocabulary = (monday, tuesday, is, a, today)# Monday Monday = [2, 0, 0, 0, 0]# today is monday = [1 0 1 1 1]# today is tuesday = [0 1 1 1 1]# is a monday today = [1 0 1 1 1]
17
Distributional hypotesis
You shall know a word by the company it keeps!Firth (1957)
18
Language Modeling (Unigrams, Bigrams, etc)
A language model is a probabilistic model that assignsprobability to any sequence of n words P(w1 ,w2 , . . . , wn)
Unigrams
Assuming that the word ocurrences are completely independent
P(w1 ,w2 , . . . , wn) = Πni=1P(wi) (3)
19
Language Modeling (Unigrams, Bigrams, etc)
Bigrams
The probability of the sequence depend on the pairwise prob-ability of a word in the sequence and the word next to it.
P(w1 ,w2 , . . . , wn) = Πni=2P(wi | wi−1) (4)
20
Word Embeddings
Word Embeddings
A set of language modeling and feature learning techniques inNLP where words or phrases from the vocabulary are mappedto vectors of real numbers in a low-dimensional space relativeto the vocabulary size (”continuous space”).
# Vector space models (VSMs) represent (embed) words in acontinous vector space.
# Semantically similar words are mapped to nearby points.# Basic idea is Distributional Hypothesis: words that appear
in the same context share semantic meaning.
21
WORD2VEC
Distributional hypotesis
You shall know a word by the company it keeps!Firth (1957)
23
Word2Vec
Figure: Two original papers published in association with word2vecby Mikolov et al. (2013)
# Efficient Estimation of Word Representations in VectorSpace https://arxiv.org/abs/1301.3781.
# Distributed Representations of Words and Phrases andtheir Compositionality https://arxiv.org/abs/1310.4546. 24
Continuous Bag of Words and Skip-gram
25
Contextual Representation
Word is represented by context in use
26
Contextual Representation
27
Word Vectors
28
Word Vectors
29
Word Vectors
30
Word Vectors
31
Word2Vec
# vking − vman + vwoman ≈ vqueen
# vparis − vfrance + vitaly ≈ vrome
# Learns from raw text# Huge splash in NLP world.# Comes pretrained. (If you don’t have any specialize
vocabulary)# Word2vec is computationally efficient model for learning
word embeddings.# Word2Vec is a successful example of ”shallow” learning.# Very simple Feedforward neural network with single hidden
layer, backpropagation, and no non-linearities.32
Word2vec
33
Gensim
34
APPLICATIONS
What the Fuck Are Trump Supporters Thinking?
36
What the Fuck Are Trump Supporters Thinking?
37
What the Fuck Are Trump Supporters Thinking?
# They gathered four million tweets belonging to more thantwo thousand hard-core Trump supporters.
# Distances between those vectors encoded the semanticdistance between their associated words (e.g. the vectorrepresentation of the word morons was near idiots but faraway from funny)
Link: https://medium.com/adventurous-social-science/what-the-fuck-are-trump-supporters-thinking-ecc16fb66a8d
38
Restaurant Recomendation.
http://www.slideshare.net/SudeepDasPhD/recsys-2015-making-meaningful-restaurant-recommendations-at-opentable
39
Restaurant Recomendation.
http://www.slideshare.net/SudeepDasPhD/recsys-2015-making-meaningful-restaurant-recommendations-at-opentable
40
Song Recomendations
Link: https://social.shorthand.com/mawsonguy/3CfQA8mj2S/playlist-harvesting 41
TAKEAWAYS
Takeaways
# If you don’t have enough data you can use pre-trainedmodels.
# Remember: Garbage in, garbage out.# Every data set will come out with diferent results.# Use Word2vec as feature extractor.
43
44
Obrigado
45