![Page 1: DataScience Lab 2017_From bag of texts to bag of clusters_Терпиль Евгений / Павел Худан (Data Scientists / NLP Engineer at YouScan)](https://reader034.vdocuments.mx/reader034/viewer/2022051318/5a6555477f8b9a5b558b6cbf/html5/thumbnails/1.jpg)
From bag of texts to bag of clusters
![Page 2: DataScience Lab 2017_From bag of texts to bag of clusters_Терпиль Евгений / Павел Худан (Data Scientists / NLP Engineer at YouScan)](https://reader034.vdocuments.mx/reader034/viewer/2022051318/5a6555477f8b9a5b558b6cbf/html5/thumbnails/2.jpg)
Paul Khudan Yevgen [email protected] [email protected]
![Page 3: DataScience Lab 2017_From bag of texts to bag of clusters_Терпиль Евгений / Павел Худан (Data Scientists / NLP Engineer at YouScan)](https://reader034.vdocuments.mx/reader034/viewer/2022051318/5a6555477f8b9a5b558b6cbf/html5/thumbnails/3.jpg)
Map of ML mentions
Mar 2017, collected by YouScan
![Page 4: DataScience Lab 2017_From bag of texts to bag of clusters_Терпиль Евгений / Павел Худан (Data Scientists / NLP Engineer at YouScan)](https://reader034.vdocuments.mx/reader034/viewer/2022051318/5a6555477f8b9a5b558b6cbf/html5/thumbnails/4.jpg)
Map of ML mentions
конференция, meetup
![Page 5: DataScience Lab 2017_From bag of texts to bag of clusters_Терпиль Евгений / Павел Худан (Data Scientists / NLP Engineer at YouScan)](https://reader034.vdocuments.mx/reader034/viewer/2022051318/5a6555477f8b9a5b558b6cbf/html5/thumbnails/5.jpg)
Map of ML mentions
Приглашаем 13 мая на Data Science Lab…
конференция, meetup
![Page 6: DataScience Lab 2017_From bag of texts to bag of clusters_Терпиль Евгений / Павел Худан (Data Scientists / NLP Engineer at YouScan)](https://reader034.vdocuments.mx/reader034/viewer/2022051318/5a6555477f8b9a5b558b6cbf/html5/thumbnails/6.jpg)
Part 1Classic approach
Word embeddings
![Page 7: DataScience Lab 2017_From bag of texts to bag of clusters_Терпиль Евгений / Павел Худан (Data Scientists / NLP Engineer at YouScan)](https://reader034.vdocuments.mx/reader034/viewer/2022051318/5a6555477f8b9a5b558b6cbf/html5/thumbnails/7.jpg)
Semantic representation of texts
1. Text (semi/un)supervised classification
2. Document retrieval
3. Topic insights
4. Text similarity/relatedness
![Page 8: DataScience Lab 2017_From bag of texts to bag of clusters_Терпиль Евгений / Павел Худан (Data Scientists / NLP Engineer at YouScan)](https://reader034.vdocuments.mx/reader034/viewer/2022051318/5a6555477f8b9a5b558b6cbf/html5/thumbnails/8.jpg)
Requirements
• Vector representation is handy
• Descriptive (not distinctive) features
• Language/style/genre independence
• Robustness to language/speech variance (word- and phrase- level synonymy, word order, newly emerging words and entities)
![Page 9: DataScience Lab 2017_From bag of texts to bag of clusters_Терпиль Евгений / Павел Худан (Data Scientists / NLP Engineer at YouScan)](https://reader034.vdocuments.mx/reader034/viewer/2022051318/5a6555477f8b9a5b558b6cbf/html5/thumbnails/9.jpg)
• Token-based methods, although char-based are more robust
• Preprocessing and unification
• Tokenization
• Lemmatization?
Prerequisites
![Page 10: DataScience Lab 2017_From bag of texts to bag of clusters_Терпиль Евгений / Павел Худан (Data Scientists / NLP Engineer at YouScan)](https://reader034.vdocuments.mx/reader034/viewer/2022051318/5a6555477f8b9a5b558b6cbf/html5/thumbnails/10.jpg)
BoW, Tf-idf and more
• Bag of Words: one-hot encoding over the observed dictionary
• TF-IDF: ‘term frequency’ * ‘inverse document frequency’ for term weighting (include different normalization schemes)
• Bag of n-grams: collocations carry more specific senses
• Singular Value Decomposition (SVD) of the original term-document matrix (compression with less relevant information loss):
◦ resolves inter-document relations: similarity
◦ resolves inter-term relations: synonymy and polysemy
◦ reduces dimensionality
![Page 11: DataScience Lab 2017_From bag of texts to bag of clusters_Терпиль Евгений / Павел Худан (Data Scientists / NLP Engineer at YouScan)](https://reader034.vdocuments.mx/reader034/viewer/2022051318/5a6555477f8b9a5b558b6cbf/html5/thumbnails/11.jpg)
BoW, Tf-idf and more
- easily interpretable
- easy to implement
- parameters are straightforward
- not robust to language variance
- scales badly
- vulnerable to overfitting
Pros Cons
![Page 12: DataScience Lab 2017_From bag of texts to bag of clusters_Терпиль Евгений / Павел Худан (Data Scientists / NLP Engineer at YouScan)](https://reader034.vdocuments.mx/reader034/viewer/2022051318/5a6555477f8b9a5b558b6cbf/html5/thumbnails/12.jpg)
ODS курс на хабре
Google купила kaggle
распознавание раковых опухолей
яндекс крипта, запросы женщин
Data Science Lab
TF-IDF + SVD + TSNE
![Page 13: DataScience Lab 2017_From bag of texts to bag of clusters_Терпиль Евгений / Павел Худан (Data Scientists / NLP Engineer at YouScan)](https://reader034.vdocuments.mx/reader034/viewer/2022051318/5a6555477f8b9a5b558b6cbf/html5/thumbnails/13.jpg)
нейронная сеть
artificial intelligence
TF-IDF + SVD
deep learning
![Page 14: DataScience Lab 2017_From bag of texts to bag of clusters_Терпиль Евгений / Павел Худан (Data Scientists / NLP Engineer at YouScan)](https://reader034.vdocuments.mx/reader034/viewer/2022051318/5a6555477f8b9a5b558b6cbf/html5/thumbnails/14.jpg)
Clustering
1. K-means
2. Hierarchical clustering
3. Density Based Scan
![Page 15: DataScience Lab 2017_From bag of texts to bag of clusters_Терпиль Евгений / Павел Худан (Data Scientists / NLP Engineer at YouScan)](https://reader034.vdocuments.mx/reader034/viewer/2022051318/5a6555477f8b9a5b558b6cbf/html5/thumbnails/15.jpg)
K-means
• Separate all observations in K groups of equal variance
• Iteratively reassign cluster members for cluster members mean to minimize the inertia: within-cluster sum of squared criterion
![Page 16: DataScience Lab 2017_From bag of texts to bag of clusters_Терпиль Евгений / Павел Худан (Data Scientists / NLP Engineer at YouScan)](https://reader034.vdocuments.mx/reader034/viewer/2022051318/5a6555477f8b9a5b558b6cbf/html5/thumbnails/16.jpg)
Hierarchical clustering
• Build a hierarchy of clusters
• Bottom-up or top-down approach (agglomerative or divisive clustering)
• Various metrics for cluster dissimilarity
• Cluster count and contents depends on chosendissimilarity threshold
Clusters:a, bc, def
![Page 17: DataScience Lab 2017_From bag of texts to bag of clusters_Терпиль Евгений / Павел Худан (Data Scientists / NLP Engineer at YouScan)](https://reader034.vdocuments.mx/reader034/viewer/2022051318/5a6555477f8b9a5b558b6cbf/html5/thumbnails/17.jpg)
Density Based Scan
• Find areas of high density separated by areas of low density of samples
• Involves two parameters: epsilon and minimum points
• Epsilon sets the minimum distance for two points to be considered close enough
Minimum points stand for the amount of mutually close points to be considered a new cluster
![Page 18: DataScience Lab 2017_From bag of texts to bag of clusters_Терпиль Евгений / Павел Худан (Data Scientists / NLP Engineer at YouScan)](https://reader034.vdocuments.mx/reader034/viewer/2022051318/5a6555477f8b9a5b558b6cbf/html5/thumbnails/18.jpg)
K-Means clusters
TF-IDF + SVD
![Page 19: DataScience Lab 2017_From bag of texts to bag of clusters_Терпиль Евгений / Павел Худан (Data Scientists / NLP Engineer at YouScan)](https://reader034.vdocuments.mx/reader034/viewer/2022051318/5a6555477f8b9a5b558b6cbf/html5/thumbnails/19.jpg)
Word embeddings
Word embeddings that capture semantics: word2vec family, fastText, GloVe
CBOW Skip-gram
![Page 20: DataScience Lab 2017_From bag of texts to bag of clusters_Терпиль Евгений / Павел Худан (Data Scientists / NLP Engineer at YouScan)](https://reader034.vdocuments.mx/reader034/viewer/2022051318/5a6555477f8b9a5b558b6cbf/html5/thumbnails/20.jpg)
Word embeddings
![Page 21: DataScience Lab 2017_From bag of texts to bag of clusters_Терпиль Евгений / Павел Худан (Data Scientists / NLP Engineer at YouScan)](https://reader034.vdocuments.mx/reader034/viewer/2022051318/5a6555477f8b9a5b558b6cbf/html5/thumbnails/21.jpg)
Word embeddings
Dimension-wise mean/sum/min/max over embeddings of words in text
Words Mover’s Distance
![Page 22: DataScience Lab 2017_From bag of texts to bag of clusters_Терпиль Евгений / Павел Худан (Data Scientists / NLP Engineer at YouScan)](https://reader034.vdocuments.mx/reader034/viewer/2022051318/5a6555477f8b9a5b558b6cbf/html5/thumbnails/22.jpg)
Word embeddings
- semantics is included
- moderately robust to language variance
- scales better, including OOV
- embeddings source and quality?
- vector relations (distance measures, separating planes) is what really means, not vector values
- meaning degrades quickly on moderate-to-large texts
- interpretation is a tedious work
Pros Cons
![Page 23: DataScience Lab 2017_From bag of texts to bag of clusters_Терпиль Евгений / Павел Худан (Data Scientists / NLP Engineer at YouScan)](https://reader034.vdocuments.mx/reader034/viewer/2022051318/5a6555477f8b9a5b558b6cbf/html5/thumbnails/23.jpg)
ODS курс на хабре
Google купила kaggle
распознавание раковых опухолей
яндекс крипта, запросы женщин
Data Science Lab
Word2Vec mean
![Page 24: DataScience Lab 2017_From bag of texts to bag of clusters_Терпиль Евгений / Павел Худан (Data Scientists / NLP Engineer at YouScan)](https://reader034.vdocuments.mx/reader034/viewer/2022051318/5a6555477f8b9a5b558b6cbf/html5/thumbnails/24.jpg)
Word2Vec mean
покупка, инвестиции
![Page 25: DataScience Lab 2017_From bag of texts to bag of clusters_Терпиль Евгений / Павел Худан (Data Scientists / NLP Engineer at YouScan)](https://reader034.vdocuments.mx/reader034/viewer/2022051318/5a6555477f8b9a5b558b6cbf/html5/thumbnails/25.jpg)
TF-IDF + SVD
покупка, инвестиции
![Page 26: DataScience Lab 2017_From bag of texts to bag of clusters_Терпиль Евгений / Павел Худан (Data Scientists / NLP Engineer at YouScan)](https://reader034.vdocuments.mx/reader034/viewer/2022051318/5a6555477f8b9a5b558b6cbf/html5/thumbnails/26.jpg)
Sense clusters
![Page 27: DataScience Lab 2017_From bag of texts to bag of clusters_Терпиль Евгений / Павел Худан (Data Scientists / NLP Engineer at YouScan)](https://reader034.vdocuments.mx/reader034/viewer/2022051318/5a6555477f8b9a5b558b6cbf/html5/thumbnails/27.jpg)
Sense clusters
0 0.9 0 0 0.95 0 0.1
3000
еда времяовощи
картошка
• Find K cluster centers over target vocabulary embeddings
• Calculate distances (cosine measure) to cluster centers for each vocabulary word, ignore relatively small ones
• Use distances as new K-dimensional feature vector (word embedding)
• Aggregate embeddings
• Normalize?
![Page 28: DataScience Lab 2017_From bag of texts to bag of clusters_Терпиль Евгений / Павел Худан (Data Scientists / NLP Engineer at YouScan)](https://reader034.vdocuments.mx/reader034/viewer/2022051318/5a6555477f8b9a5b558b6cbf/html5/thumbnails/28.jpg)
Sense clusters
- semantics is now valuable(expressed by concrete values in vectors)
- meaning now accumulates in text vectors better
- it is possible to retrofit clusters on sense interpretations for readability
- inherited from word embeddings
- chained complexity
- additional parameters to fiddle with
- vector length is higher (around 3k dimensions) -> bigger, cumbersome, heavier
Pros Cons
![Page 29: DataScience Lab 2017_From bag of texts to bag of clusters_Терпиль Евгений / Павел Худан (Data Scientists / NLP Engineer at YouScan)](https://reader034.vdocuments.mx/reader034/viewer/2022051318/5a6555477f8b9a5b558b6cbf/html5/thumbnails/29.jpg)
ODS курс на хабре
Google купила kaggle
распознавание раковых опухолей
яндекс крипта, запросы женщин
Data Science Lab
Word2Sense mean
![Page 30: DataScience Lab 2017_From bag of texts to bag of clusters_Терпиль Евгений / Павел Худан (Data Scientists / NLP Engineer at YouScan)](https://reader034.vdocuments.mx/reader034/viewer/2022051318/5a6555477f8b9a5b558b6cbf/html5/thumbnails/30.jpg)
покупка, инвестиции
Word2Sense mean
![Page 31: DataScience Lab 2017_From bag of texts to bag of clusters_Терпиль Евгений / Павел Худан (Data Scientists / NLP Engineer at YouScan)](https://reader034.vdocuments.mx/reader034/viewer/2022051318/5a6555477f8b9a5b558b6cbf/html5/thumbnails/31.jpg)
Doc2Vec
![Page 32: DataScience Lab 2017_From bag of texts to bag of clusters_Терпиль Евгений / Павел Худан (Data Scientists / NLP Engineer at YouScan)](https://reader034.vdocuments.mx/reader034/viewer/2022051318/5a6555477f8b9a5b558b6cbf/html5/thumbnails/32.jpg)
ODS курс на хабре
Google купила kaggle
яндекс крипта, запросы женщин
Doc2Vec
![Page 33: DataScience Lab 2017_From bag of texts to bag of clusters_Терпиль Евгений / Павел Худан (Data Scientists / NLP Engineer at YouScan)](https://reader034.vdocuments.mx/reader034/viewer/2022051318/5a6555477f8b9a5b558b6cbf/html5/thumbnails/33.jpg)
Part 2Alternatives
Deep learning
![Page 34: DataScience Lab 2017_From bag of texts to bag of clusters_Терпиль Евгений / Павел Худан (Data Scientists / NLP Engineer at YouScan)](https://reader034.vdocuments.mx/reader034/viewer/2022051318/5a6555477f8b9a5b558b6cbf/html5/thumbnails/34.jpg)
ODS курс на хабре
Google купила kaggle
распознавание раковых опухолей
яндекс крипта, запросы женщин
Data Science Lab
K-Means representation
![Page 35: DataScience Lab 2017_From bag of texts to bag of clusters_Терпиль Евгений / Павел Худан (Data Scientists / NLP Engineer at YouScan)](https://reader034.vdocuments.mx/reader034/viewer/2022051318/5a6555477f8b9a5b558b6cbf/html5/thumbnails/35.jpg)
Topic modeling
![Page 36: DataScience Lab 2017_From bag of texts to bag of clusters_Терпиль Евгений / Павел Худан (Data Scientists / NLP Engineer at YouScan)](https://reader034.vdocuments.mx/reader034/viewer/2022051318/5a6555477f8b9a5b558b6cbf/html5/thumbnails/36.jpg)
LDA
Google купила kaggle
ODS курс на хабре
![Page 37: DataScience Lab 2017_From bag of texts to bag of clusters_Терпиль Евгений / Павел Худан (Data Scientists / NLP Engineer at YouScan)](https://reader034.vdocuments.mx/reader034/viewer/2022051318/5a6555477f8b9a5b558b6cbf/html5/thumbnails/37.jpg)
Sequence-to-Sequence Models
document vector
Neural Machine Translation Text Summarization
Examples:
![Page 38: DataScience Lab 2017_From bag of texts to bag of clusters_Терпиль Евгений / Павел Худан (Data Scientists / NLP Engineer at YouScan)](https://reader034.vdocuments.mx/reader034/viewer/2022051318/5a6555477f8b9a5b558b6cbf/html5/thumbnails/38.jpg)
sentence vector
Objective
Skip Thought
![Page 39: DataScience Lab 2017_From bag of texts to bag of clusters_Терпиль Евгений / Павел Худан (Data Scientists / NLP Engineer at YouScan)](https://reader034.vdocuments.mx/reader034/viewer/2022051318/5a6555477f8b9a5b558b6cbf/html5/thumbnails/39.jpg)
word embedding
Objective
Fast Sent
Sentence representation
softmax
![Page 40: DataScience Lab 2017_From bag of texts to bag of clusters_Терпиль Евгений / Павел Худан (Data Scientists / NLP Engineer at YouScan)](https://reader034.vdocuments.mx/reader034/viewer/2022051318/5a6555477f8b9a5b558b6cbf/html5/thumbnails/40.jpg)
ODS курс на хабре Google купила kaggle
распознавание раковых опухолей
яндекс крипта, запросы женщин
Data Science Lab
Fast Sent
![Page 41: DataScience Lab 2017_From bag of texts to bag of clusters_Терпиль Евгений / Павел Худан (Data Scientists / NLP Engineer at YouScan)](https://reader034.vdocuments.mx/reader034/viewer/2022051318/5a6555477f8b9a5b558b6cbf/html5/thumbnails/41.jpg)
покупка, инвестиции
Fast Sent
![Page 42: DataScience Lab 2017_From bag of texts to bag of clusters_Терпиль Евгений / Павел Худан (Data Scientists / NLP Engineer at YouScan)](https://reader034.vdocuments.mx/reader034/viewer/2022051318/5a6555477f8b9a5b558b6cbf/html5/thumbnails/42.jpg)
Fast Sent
конференция, meetup
![Page 43: DataScience Lab 2017_From bag of texts to bag of clusters_Терпиль Евгений / Павел Худан (Data Scientists / NLP Engineer at YouScan)](https://reader034.vdocuments.mx/reader034/viewer/2022051318/5a6555477f8b9a5b558b6cbf/html5/thumbnails/43.jpg)
Sequential Denoising Autoencoder (SDAE)
купил для исследователейGoogle
Google купил для
исследователей
сервис
сервис
купил сервис для
Delete word Swap bigram
Corrupt sentence by
p0 Є [0, 1] px Є [0, 1]
and predict original sentence
![Page 44: DataScience Lab 2017_From bag of texts to bag of clusters_Терпиль Евгений / Павел Худан (Data Scientists / NLP Engineer at YouScan)](https://reader034.vdocuments.mx/reader034/viewer/2022051318/5a6555477f8b9a5b558b6cbf/html5/thumbnails/44.jpg)
ODS курс на хабре
Google купила kaggle
яндекс крипта, запросы женщин
Data Science Lab
SDAE
![Page 45: DataScience Lab 2017_From bag of texts to bag of clusters_Терпиль Евгений / Павел Худан (Data Scientists / NLP Engineer at YouScan)](https://reader034.vdocuments.mx/reader034/viewer/2022051318/5a6555477f8b9a5b558b6cbf/html5/thumbnails/45.jpg)
конференция, meetup
SDAE
![Page 46: DataScience Lab 2017_From bag of texts to bag of clusters_Терпиль Евгений / Павел Худан (Data Scientists / NLP Engineer at YouScan)](https://reader034.vdocuments.mx/reader034/viewer/2022051318/5a6555477f8b9a5b558b6cbf/html5/thumbnails/46.jpg)
Supervised evaluations
Learning Distributed Representations of Sentences from Unlabelled Data
![Page 47: DataScience Lab 2017_From bag of texts to bag of clusters_Терпиль Евгений / Павел Худан (Data Scientists / NLP Engineer at YouScan)](https://reader034.vdocuments.mx/reader034/viewer/2022051318/5a6555477f8b9a5b558b6cbf/html5/thumbnails/47.jpg)
Unsupervised (relatedness) evaluations
Learning Distributed Representations of Sentences from Unlabelled Data
![Page 48: DataScience Lab 2017_From bag of texts to bag of clusters_Терпиль Евгений / Павел Худан (Data Scientists / NLP Engineer at YouScan)](https://reader034.vdocuments.mx/reader034/viewer/2022051318/5a6555477f8b9a5b558b6cbf/html5/thumbnails/48.jpg)
LinksLearning Distributed Representations of Sentences from Unlabelled Data http://www.aclweb.org/anthology/N16-1162
FastSent, SDAE https://github.com/fh295/SentenceRepresentation
Skip-Thought Vectors https://github.com/ryankiros/skip-thoughts
Sense clusters https://servponomarev.livejournal.com/10604 https://habrahabr.ru/post/277563/
![Page 49: DataScience Lab 2017_From bag of texts to bag of clusters_Терпиль Евгений / Павел Худан (Data Scientists / NLP Engineer at YouScan)](https://reader034.vdocuments.mx/reader034/viewer/2022051318/5a6555477f8b9a5b558b6cbf/html5/thumbnails/49.jpg)
Questions?