datascience lab 2017_from bag of texts to bag of clusters_Терпиль Евгений / Павел...

Post on 22-Jan-2018

134 Views

Category:

Technology

4 Downloads

Preview:

Click to see full reader

TRANSCRIPT

From bag of texts to bag of clusters

Paul Khudan Yevgen Terpilpk@youscan.io jt@youscan.io

Map of ML mentions

Mar 2017, collected by YouScan

Map of ML mentions

конференция, meetup

Map of ML mentions

Приглашаем 13 мая на Data Science Lab…

конференция, meetup

Part 1Classic approach

Word embeddings

Semantic representation of texts

1. Text (semi/un)supervised classification

2. Document retrieval

3. Topic insights

4. Text similarity/relatedness

Requirements

• Vector representation is handy

• Descriptive (not distinctive) features

• Language/style/genre independence

• Robustness to language/speech variance (word- and phrase- level synonymy, word order, newly emerging words and entities)

• Token-based methods, although char-based are more robust

• Preprocessing and unification

• Tokenization

• Lemmatization?

Prerequisites

BoW, Tf-idf and more

• Bag of Words: one-hot encoding over the observed dictionary

• TF-IDF: ‘term frequency’ * ‘inverse document frequency’ for term weighting (include different normalization schemes)

• Bag of n-grams: collocations carry more specific senses

• Singular Value Decomposition (SVD) of the original term-document matrix (compression with less relevant information loss):

◦ resolves inter-document relations: similarity

◦ resolves inter-term relations: synonymy and polysemy

◦ reduces dimensionality

BoW, Tf-idf and more

- easily interpretable

- easy to implement

- parameters are straightforward

- not robust to language variance

- scales badly

- vulnerable to overfitting

Pros Cons

ODS курс на хабре

Google купила kaggle

распознавание раковых опухолей

яндекс крипта, запросы женщин

Data Science Lab

TF-IDF + SVD + TSNE

нейронная сеть

artificial intelligence

TF-IDF + SVD

deep learning

Clustering

1. K-means

2. Hierarchical clustering

3. Density Based Scan

K-means

• Separate all observations in K groups of equal variance

• Iteratively reassign cluster members for cluster members mean to minimize the inertia: within-cluster sum of squared criterion

Hierarchical clustering

• Build a hierarchy of clusters

• Bottom-up or top-down approach (agglomerative or divisive clustering)

• Various metrics for cluster dissimilarity

• Cluster count and contents depends on chosendissimilarity threshold

Clusters:a, bc, def

Density Based Scan

• Find areas of high density separated by areas of low density of samples

• Involves two parameters: epsilon and minimum points

• Epsilon sets the minimum distance for two points to be considered close enough

Minimum points stand for the amount of mutually close points to be considered a new cluster

K-Means clusters

TF-IDF + SVD

Word embeddings

Word embeddings that capture semantics: word2vec family, fastText, GloVe

CBOW Skip-gram

Word embeddings

Word embeddings

Dimension-wise mean/sum/min/max over embeddings of words in text

Words Mover’s Distance

Word embeddings

- semantics is included

- moderately robust to language variance

- scales better, including OOV

- embeddings source and quality?

- vector relations (distance measures, separating planes) is what really means, not vector values

- meaning degrades quickly on moderate-to-large texts

- interpretation is a tedious work

Pros Cons

ODS курс на хабре

Google купила kaggle

распознавание раковых опухолей

яндекс крипта, запросы женщин

Data Science Lab

Word2Vec mean

Word2Vec mean

покупка, инвестиции

TF-IDF + SVD

покупка, инвестиции

Sense clusters

Sense clusters

0 0.9 0 0 0.95 0 0.1

3000

еда времяовощи

картошка

• Find K cluster centers over target vocabulary embeddings

• Calculate distances (cosine measure) to cluster centers for each vocabulary word, ignore relatively small ones

• Use distances as new K-dimensional feature vector (word embedding)

• Aggregate embeddings

• Normalize?

Sense clusters

- semantics is now valuable(expressed by concrete values in vectors)

- meaning now accumulates in text vectors better

- it is possible to retrofit clusters on sense interpretations for readability

- inherited from word embeddings

- chained complexity

- additional parameters to fiddle with

- vector length is higher (around 3k dimensions) -> bigger, cumbersome, heavier

Pros Cons

ODS курс на хабре

Google купила kaggle

распознавание раковых опухолей

яндекс крипта, запросы женщин

Data Science Lab

Word2Sense mean

покупка, инвестиции

Word2Sense mean

Doc2Vec

ODS курс на хабре

Google купила kaggle

яндекс крипта, запросы женщин

Doc2Vec

Part 2Alternatives

Deep learning

ODS курс на хабре

Google купила kaggle

распознавание раковых опухолей

яндекс крипта, запросы женщин

Data Science Lab

K-Means representation

Topic modeling

LDA

Google купила kaggle

ODS курс на хабре

Sequence-to-Sequence Models

document vector

Neural Machine Translation Text Summarization

Examples:

sentence vector

Objective

Skip Thought

word embedding

Objective

Fast Sent

Sentence representation

softmax

ODS курс на хабре Google купила kaggle

распознавание раковых опухолей

яндекс крипта, запросы женщин

Data Science Lab

Fast Sent

покупка, инвестиции

Fast Sent

Fast Sent

конференция, meetup

Sequential Denoising Autoencoder (SDAE)

купил для исследователейGoogle

Google

Google купил для

исследователей

сервис

сервис

купил сервис для

Delete word Swap bigram

Corrupt sentence by

p0 Є [0, 1] px Є [0, 1]

and predict original sentence

ODS курс на хабре

Google купила kaggle

яндекс крипта, запросы женщин

Data Science Lab

SDAE

конференция, meetup

SDAE

Supervised evaluations

Learning Distributed Representations of Sentences from Unlabelled Data

Unsupervised (relatedness) evaluations

Learning Distributed Representations of Sentences from Unlabelled Data

LinksLearning Distributed Representations of Sentences from Unlabelled Data http://www.aclweb.org/anthology/N16-1162

FastSent, SDAE https://github.com/fh295/SentenceRepresentation

Skip-Thought Vectors https://github.com/ryankiros/skip-thoughts

Sense clusters https://servponomarev.livejournal.com/10604 https://habrahabr.ru/post/277563/

Questions?

top related