sentiment analysis: best practices and challenges · text preprocessing • nltk – over 50...

84
Sentiment Analysis: best practices and challenges Vitalii Radchenko

Upload: others

Post on 10-Aug-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Sentiment Analysis: best practices and challenges · Text preprocessing • NLTK – over 50 corpora, wordNet, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers

Sentiment Analysis: best practices and challenges

Vitalii Radchenko

Page 2: Sentiment Analysis: best practices and challenges · Text preprocessing • NLTK – over 50 corpora, wordNet, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers

Problem definition

• A company wants to build sentiment analysis model

• Main task is to classify review: positive or negative

• Metrics: accuracy / f1score

2

Page 3: Sentiment Analysis: best practices and challenges · Text preprocessing • NLTK – over 50 corpora, wordNet, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers

Data sources

https://github.com/udsclub/Project1.-Sentiment-Analysis/tree/master/data https://github.com/udsclub/Project1.-Sentiment-Analysis/tree/master/src/parsers 3

Page 4: Sentiment Analysis: best practices and challenges · Text preprocessing • NLTK – over 50 corpora, wordNet, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers

Data sources• Open datasets:

• Amazon – 143.7 million reviews

• Imdb (50k), RT, Twitter (1.5kk)

https://github.com/udsclub/Project1.-Sentiment-Analysis/tree/master/data https://github.com/udsclub/Project1.-Sentiment-Analysis/tree/master/src/parsers 3

Page 5: Sentiment Analysis: best practices and challenges · Text preprocessing • NLTK – over 50 corpora, wordNet, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers

Data sources• Open datasets:

• Amazon – 143.7 million reviews

• Imdb (50k), RT, Twitter (1.5kk)

https://github.com/udsclub/Project1.-Sentiment-Analysis/tree/master/data https://github.com/udsclub/Project1.-Sentiment-Analysis/tree/master/src/parsers 3

Page 6: Sentiment Analysis: best practices and challenges · Text preprocessing • NLTK – over 50 corpora, wordNet, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers

Data sources• Open datasets:

• Amazon – 143.7 million reviews

• Imdb (50k), RT, Twitter (1.5kk)

• Parse data:

• iHerb, iTunes, RT, GoodReads, Expedia, Yelp etc.

• Remember about Terms of Usehttps://github.com/udsclub/Project1.-Sentiment-Analysis/tree/master/data https://github.com/udsclub/Project1.-Sentiment-Analysis/tree/master/src/parsers 3

Page 7: Sentiment Analysis: best practices and challenges · Text preprocessing • NLTK – over 50 corpora, wordNet, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers

Data sources• Open datasets:

• Amazon – 143.7 million reviews

• Imdb (50k), RT, Twitter (1.5kk)

• Parse data:

• iHerb, iTunes, RT, GoodReads, Expedia, Yelp etc.

• Remember about Terms of Usehttps://github.com/udsclub/Project1.-Sentiment-Analysis/tree/master/data https://github.com/udsclub/Project1.-Sentiment-Analysis/tree/master/src/parsers 3

Page 8: Sentiment Analysis: best practices and challenges · Text preprocessing • NLTK – over 50 corpora, wordNet, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers

Data Analysis

https://github.com/udsclub/Project1.-Sentiment-Analysis/tree/master/data4

Page 9: Sentiment Analysis: best practices and challenges · Text preprocessing • NLTK – over 50 corpora, wordNet, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers

Data Analysis• Very important don’t skip this step

https://github.com/udsclub/Project1.-Sentiment-Analysis/tree/master/data4

Page 10: Sentiment Analysis: best practices and challenges · Text preprocessing • NLTK – over 50 corpora, wordNet, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers

Data Analysis• Very important don’t skip this step

• Calculate simple statistics:

• Count reviews

• Mean number of words in review, mean length (chars)

• Look at number of words distribution (<3, 4-10, 11-50, >51)

• Count duplicates

• Check languages (with SpaCy)https://github.com/udsclub/Project1.-Sentiment-Analysis/tree/master/data

4

Page 11: Sentiment Analysis: best practices and challenges · Text preprocessing • NLTK – over 50 corpora, wordNet, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers

Data preprocessing

• Text preprocessing

• Text —> Vector

• Embeddings

• Dimensionality reduction

5

Children -> child

Better -> good

Page 12: Sentiment Analysis: best practices and challenges · Text preprocessing • NLTK – over 50 corpora, wordNet, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers

Text preprocessing

https://github.com/udsclub/workshop/blob/master/notebooks/UDS-workshop-NLP_Libraries.ipynb 6

Page 13: Sentiment Analysis: best practices and challenges · Text preprocessing • NLTK – over 50 corpora, wordNet, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers

Text preprocessing• NLTK – over 50 corpora, wordNet, tokenization, stemming,

tagging, parsing, and semantic reasoning, wrappers for industrial-strength NLP libraries

https://github.com/udsclub/workshop/blob/master/notebooks/UDS-workshop-NLP_Libraries.ipynb 6

Page 14: Sentiment Analysis: best practices and challenges · Text preprocessing • NLTK – over 50 corpora, wordNet, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers

Text preprocessing• NLTK – over 50 corpora, wordNet, tokenization, stemming,

tagging, parsing, and semantic reasoning, wrappers for industrial-strength NLP libraries

• TextBlob – part-of-speech tagging, noun phrase extraction, sentiment analysis, classification, translation etc

https://github.com/udsclub/workshop/blob/master/notebooks/UDS-workshop-NLP_Libraries.ipynb 6

Page 15: Sentiment Analysis: best practices and challenges · Text preprocessing • NLTK – over 50 corpora, wordNet, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers

Text preprocessing• NLTK – over 50 corpora, wordNet, tokenization, stemming,

tagging, parsing, and semantic reasoning, wrappers for industrial-strength NLP libraries

• TextBlob – part-of-speech tagging, noun phrase extraction, sentiment analysis, classification, translation etc

• Pattern – fast part-of-speech tagger for English, sentiment analysis, tools for English verb conjugation and noun singularization & pluralization, and a WordNet interface

https://github.com/udsclub/workshop/blob/master/notebooks/UDS-workshop-NLP_Libraries.ipynb 6

Page 16: Sentiment Analysis: best practices and challenges · Text preprocessing • NLTK – over 50 corpora, wordNet, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers

Text preprocessing• NLTK – over 50 corpora, wordNet, tokenization, stemming,

tagging, parsing, and semantic reasoning, wrappers for industrial-strength NLP libraries

• TextBlob – part-of-speech tagging, noun phrase extraction, sentiment analysis, classification, translation etc

• Pattern – fast part-of-speech tagger for English, sentiment analysis, tools for English verb conjugation and noun singularization & pluralization, and a WordNet interface

• spaCy – tokenization, syntax-driven sentence segmentation, pre-trained word vectors, part-of-speech tagging, named entity recognition, labelled dependency parsing (Cython)

https://github.com/udsclub/workshop/blob/master/notebooks/UDS-workshop-NLP_Libraries.ipynb 6

Page 17: Sentiment Analysis: best practices and challenges · Text preprocessing • NLTK – over 50 corpora, wordNet, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers

Text —> Vector

https://github.com/udsclub/workshop/blob/master/notebooks/UDS-workshop-feature-extraction-and-engineering.ipynb7

Page 18: Sentiment Analysis: best practices and challenges · Text preprocessing • NLTK – over 50 corpora, wordNet, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers

Text —> Vector• Bag of Words

• CountVectorizer (ngrams, max/min_df, max_features)

• Tf/Idf (ngrams, max/min_df, max_features, norm, smooth_idf)

• HashingVectorizer (ngrams, n_features, non_negative)

https://github.com/udsclub/workshop/blob/master/notebooks/UDS-workshop-feature-extraction-and-engineering.ipynb7

Page 19: Sentiment Analysis: best practices and challenges · Text preprocessing • NLTK – over 50 corpora, wordNet, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers

Text —> Vector• Bag of Words

• CountVectorizer (ngrams, max/min_df, max_features)

• Tf/Idf (ngrams, max/min_df, max_features, norm, smooth_idf)

• HashingVectorizer (ngrams, n_features, non_negative)

• Sentiment features

• polarity, subjectivity (TextBlob)

• contrast conjunctions

• pos and neg smiles

https://github.com/udsclub/workshop/blob/master/notebooks/UDS-workshop-feature-extraction-and-engineering.ipynb7

Page 20: Sentiment Analysis: best practices and challenges · Text preprocessing • NLTK – over 50 corpora, wordNet, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers

Text —> Vector• Bag of Words

• CountVectorizer (ngrams, max/min_df, max_features)

• Tf/Idf (ngrams, max/min_df, max_features, norm, smooth_idf)

• HashingVectorizer (ngrams, n_features, non_negative)

• Sentiment features

• polarity, subjectivity (TextBlob)

• contrast conjunctions

• pos and neg smiles

• Manual features

• count exclamation, question marks

• uppercase words

• extract rating from text (“2/10”)

https://github.com/udsclub/workshop/blob/master/notebooks/UDS-workshop-feature-extraction-and-engineering.ipynb7

Page 21: Sentiment Analysis: best practices and challenges · Text preprocessing • NLTK – over 50 corpora, wordNet, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers

Embeddings

https://github.com/3Top/word2vec-api https://github.com/udsclub/workshop/blob/master/notebooks/USDC-workshop-word2vec_practice_gensim.ipynb 8

Page 22: Sentiment Analysis: best practices and challenges · Text preprocessing • NLTK – over 50 corpora, wordNet, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers

Embeddings• Word2Vec

• pre-trained : GoogleNews 6Bx300

• gensim - fastest (available on tf)

https://github.com/3Top/word2vec-api https://github.com/udsclub/workshop/blob/master/notebooks/USDC-workshop-word2vec_practice_gensim.ipynb 8

Page 23: Sentiment Analysis: best practices and challenges · Text preprocessing • NLTK – over 50 corpora, wordNet, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers

Embeddings• Word2Vec

• pre-trained : GoogleNews 6Bx300

• gensim - fastest (available on tf)

• Glove

• pre-trained: Stanford

• C(Stanford)/tf/numpy implementation

https://github.com/3Top/word2vec-api https://github.com/udsclub/workshop/blob/master/notebooks/USDC-workshop-word2vec_practice_gensim.ipynb 8

Page 24: Sentiment Analysis: best practices and challenges · Text preprocessing • NLTK – over 50 corpora, wordNet, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers

Embeddings• Word2Vec

• pre-trained : GoogleNews 6Bx300

• gensim - fastest (available on tf)

• Glove

• pre-trained: Stanford

• C(Stanford)/tf/numpy implementation

• HellingerPCAhttps://github.com/3Top/word2vec-api https://github.com/udsclub/workshop/blob/master/notebooks/USDC-workshop-word2vec_practice_gensim.ipynb 8

Page 25: Sentiment Analysis: best practices and challenges · Text preprocessing • NLTK – over 50 corpora, wordNet, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers

Dimensionality reduction

9

Page 26: Sentiment Analysis: best practices and challenges · Text preprocessing • NLTK – over 50 corpora, wordNet, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers

Dimensionality reduction

• PCA & SVD doesn’t work with sparse matrixes

9

Page 27: Sentiment Analysis: best practices and challenges · Text preprocessing • NLTK – over 50 corpora, wordNet, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers

Dimensionality reduction

• PCA & SVD doesn’t work with sparse matrixes

• TruncatedSVD

9

Page 28: Sentiment Analysis: best practices and challenges · Text preprocessing • NLTK – over 50 corpora, wordNet, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers

Approaches

10

Page 29: Sentiment Analysis: best practices and challenges · Text preprocessing • NLTK – over 50 corpora, wordNet, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers

Approaches• Linear Models:

• SVM, Logistic Regression, Naive Bayes

10

Page 30: Sentiment Analysis: best practices and challenges · Text preprocessing • NLTK – over 50 corpora, wordNet, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers

Approaches• Linear Models:

• SVM, Logistic Regression, Naive Bayes

• Trees, ensembles and boosting

• Random Forest, Extratrees, Xgboost, LightGbm

10

Page 31: Sentiment Analysis: best practices and challenges · Text preprocessing • NLTK – over 50 corpora, wordNet, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers

Approaches• Linear Models:

• SVM, Logistic Regression, Naive Bayes

• Trees, ensembles and boosting

• Random Forest, Extratrees, Xgboost, LightGbm

• FastText

10

Page 32: Sentiment Analysis: best practices and challenges · Text preprocessing • NLTK – over 50 corpora, wordNet, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers

Approaches• Linear Models:

• SVM, Logistic Regression, Naive Bayes

• Trees, ensembles and boosting

• Random Forest, Extratrees, Xgboost, LightGbm

• FastText

• Word-based NN

• LSTM, GRU, CNN

10

Page 33: Sentiment Analysis: best practices and challenges · Text preprocessing • NLTK – over 50 corpora, wordNet, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers

Approaches• Linear Models:

• SVM, Logistic Regression, Naive Bayes

• Trees, ensembles and boosting

• Random Forest, Extratrees, Xgboost, LightGbm

• FastText

• Word-based NN

• LSTM, GRU, CNN

• Char-based NN

• CNN&Dense, CNN&LSTM

10

Page 34: Sentiment Analysis: best practices and challenges · Text preprocessing • NLTK – over 50 corpora, wordNet, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers

Linear Models

https://github.com/udsclub/workshop/blob/master/notebooks/USDC-workshop-Linear_models__svm_logistic_regression.ipynb https://github.com/udsclub/workshop/blob/master/notebooks/UDS-workshop-working-with-linear-models.ipynb https://github.com/udsclub/xray-sentiment-analysis https://github.com/udsclub/zulu-sentiment-analysis 11

Page 35: Sentiment Analysis: best practices and challenges · Text preprocessing • NLTK – over 50 corpora, wordNet, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers

Linear Models• LinearSVC with small data, Logistic Regression –

with bigger data

https://github.com/udsclub/workshop/blob/master/notebooks/USDC-workshop-Linear_models__svm_logistic_regression.ipynb https://github.com/udsclub/workshop/blob/master/notebooks/UDS-workshop-working-with-linear-models.ipynb https://github.com/udsclub/xray-sentiment-analysis https://github.com/udsclub/zulu-sentiment-analysis 11

Page 36: Sentiment Analysis: best practices and challenges · Text preprocessing • NLTK – over 50 corpora, wordNet, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers

Linear Models• LinearSVC with small data, Logistic Regression –

with bigger data

• Count/Tf-Idf Vectorizer with many ngrams and regularization (min/max_df, max_features)

https://github.com/udsclub/workshop/blob/master/notebooks/USDC-workshop-Linear_models__svm_logistic_regression.ipynb https://github.com/udsclub/workshop/blob/master/notebooks/UDS-workshop-working-with-linear-models.ipynb https://github.com/udsclub/xray-sentiment-analysis https://github.com/udsclub/zulu-sentiment-analysis 11

Page 37: Sentiment Analysis: best practices and challenges · Text preprocessing • NLTK – over 50 corpora, wordNet, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers

Linear Models• LinearSVC with small data, Logistic Regression –

with bigger data

• Count/Tf-Idf Vectorizer with many ngrams and regularization (min/max_df, max_features)

• Stemming/Lemmatization don’t work

https://github.com/udsclub/workshop/blob/master/notebooks/USDC-workshop-Linear_models__svm_logistic_regression.ipynb https://github.com/udsclub/workshop/blob/master/notebooks/UDS-workshop-working-with-linear-models.ipynb https://github.com/udsclub/xray-sentiment-analysis https://github.com/udsclub/zulu-sentiment-analysis 11

Page 38: Sentiment Analysis: best practices and challenges · Text preprocessing • NLTK – over 50 corpora, wordNet, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers

Linear Models• LinearSVC with small data, Logistic Regression –

with bigger data

• Count/Tf-Idf Vectorizer with many ngrams and regularization (min/max_df, max_features)

• Stemming/Lemmatization don’t work

• Remove stopwords with cross-validationhttps://github.com/udsclub/workshop/blob/master/notebooks/USDC-workshop-Linear_models__svm_logistic_regression.ipynb https://github.com/udsclub/workshop/blob/master/notebooks/UDS-workshop-working-with-linear-models.ipynb https://github.com/udsclub/xray-sentiment-analysis https://github.com/udsclub/zulu-sentiment-analysis 11

Page 39: Sentiment Analysis: best practices and challenges · Text preprocessing • NLTK – over 50 corpora, wordNet, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers

Trees, ensembles and boosting

https://github.com/udsclub/kilo-sentiment-analysis https://github.com/udsclub/workshop/blob/master/notebooks/UDS-workshop-trees-and-boosting.ipynb

12

Page 40: Sentiment Analysis: best practices and challenges · Text preprocessing • NLTK – over 50 corpora, wordNet, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers

Trees, ensembles and boosting

https://github.com/udsclub/kilo-sentiment-analysis https://github.com/udsclub/workshop/blob/master/notebooks/UDS-workshop-trees-and-boosting.ipynb

12

Page 41: Sentiment Analysis: best practices and challenges · Text preprocessing • NLTK – over 50 corpora, wordNet, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers

Trees, ensembles and boosting

https://github.com/udsclub/kilo-sentiment-analysis https://github.com/udsclub/workshop/blob/master/notebooks/UDS-workshop-trees-and-boosting.ipynb

12

Page 42: Sentiment Analysis: best practices and challenges · Text preprocessing • NLTK – over 50 corpora, wordNet, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers

Trees, ensembles and boosting

• The worst models for sentiment analysis :)

https://github.com/udsclub/kilo-sentiment-analysis https://github.com/udsclub/workshop/blob/master/notebooks/UDS-workshop-trees-and-boosting.ipynb

12

Page 43: Sentiment Analysis: best practices and challenges · Text preprocessing • NLTK – over 50 corpora, wordNet, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers

Trees, ensembles and boosting

• The worst models for sentiment analysis :)

• Overfit

https://github.com/udsclub/kilo-sentiment-analysis https://github.com/udsclub/workshop/blob/master/notebooks/UDS-workshop-trees-and-boosting.ipynb

12

Page 44: Sentiment Analysis: best practices and challenges · Text preprocessing • NLTK – over 50 corpora, wordNet, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers

Trees, ensembles and boosting

• The worst models for sentiment analysis :)

• Overfit

• Works good in ensemble with linear models

https://github.com/udsclub/kilo-sentiment-analysis https://github.com/udsclub/workshop/blob/master/notebooks/UDS-workshop-trees-and-boosting.ipynb

12

Page 45: Sentiment Analysis: best practices and challenges · Text preprocessing • NLTK – over 50 corpora, wordNet, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers

FastText

https://github.com/udsclub/workshop/blob/master/notebooks/UDS-workshop-fastText.ipynb https://github.com/udsclub/foxtrot-sentiment-analysis 13

Page 46: Sentiment Analysis: best practices and challenges · Text preprocessing • NLTK – over 50 corpora, wordNet, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers

FastText• Very simple

https://github.com/udsclub/workshop/blob/master/notebooks/UDS-workshop-fastText.ipynb https://github.com/udsclub/foxtrot-sentiment-analysis 13

Page 47: Sentiment Analysis: best practices and challenges · Text preprocessing • NLTK – over 50 corpora, wordNet, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers

FastText• Very simple

• Needs text preprocessing (spaCy.en + stopwords)

https://github.com/udsclub/workshop/blob/master/notebooks/UDS-workshop-fastText.ipynb https://github.com/udsclub/foxtrot-sentiment-analysis 13

Page 48: Sentiment Analysis: best practices and challenges · Text preprocessing • NLTK – over 50 corpora, wordNet, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers

FastText• Very simple

• Needs text preprocessing (spaCy.en + stopwords)

• Pre-trained vectors wiki.en

https://github.com/udsclub/workshop/blob/master/notebooks/UDS-workshop-fastText.ipynb https://github.com/udsclub/foxtrot-sentiment-analysis 13

Page 49: Sentiment Analysis: best practices and challenges · Text preprocessing • NLTK – over 50 corpora, wordNet, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers

FastText• Very simple

• Needs text preprocessing (spaCy.en + stopwords)

• Pre-trained vectors wiki.en

• Tune regularization parameters

https://github.com/udsclub/workshop/blob/master/notebooks/UDS-workshop-fastText.ipynb https://github.com/udsclub/foxtrot-sentiment-analysis 13

Page 50: Sentiment Analysis: best practices and challenges · Text preprocessing • NLTK – over 50 corpora, wordNet, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers

FastText• Very simple

• Needs text preprocessing (spaCy.en + stopwords)

• Pre-trained vectors wiki.en

• Tune regularization parameters

• Good result

https://github.com/udsclub/workshop/blob/master/notebooks/UDS-workshop-fastText.ipynb https://github.com/udsclub/foxtrot-sentiment-analysis 13

Page 51: Sentiment Analysis: best practices and challenges · Text preprocessing • NLTK – over 50 corpora, wordNet, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers

FastText• Very simple

• Needs text preprocessing (spaCy.en + stopwords)

• Pre-trained vectors wiki.en

• Tune regularization parameters

• Good result

https://github.com/udsclub/workshop/blob/master/notebooks/UDS-workshop-fastText.ipynb https://github.com/udsclub/foxtrot-sentiment-analysis 13

Page 52: Sentiment Analysis: best practices and challenges · Text preprocessing • NLTK – over 50 corpora, wordNet, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers

FastText• Very simple

• Needs text preprocessing (spaCy.en + stopwords)

• Pre-trained vectors wiki.en

• Tune regularization parameters

• Good result

https://github.com/udsclub/workshop/blob/master/notebooks/UDS-workshop-fastText.ipynb https://github.com/udsclub/foxtrot-sentiment-analysis 13

Deal with it

Page 53: Sentiment Analysis: best practices and challenges · Text preprocessing • NLTK – over 50 corpora, wordNet, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers

Word-based NN (LSTM)

https://github.com/udsclub/workshop/blob/master/notebooks/UDS-workshop-rnn.ipynb https://github.com/udsclub/alpha-sentiment-analysis/tree/master/full_movie_reviews

14

Page 54: Sentiment Analysis: best practices and challenges · Text preprocessing • NLTK – over 50 corpora, wordNet, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers

Word-based NN (LSTM)• The best simple LSTM

https://github.com/udsclub/workshop/blob/master/notebooks/UDS-workshop-rnn.ipynb https://github.com/udsclub/alpha-sentiment-analysis/tree/master/full_movie_reviews

14

Page 55: Sentiment Analysis: best practices and challenges · Text preprocessing • NLTK – over 50 corpora, wordNet, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers

Word-based NN (LSTM)• The best simple LSTM

• Pre-trained google word2vec as embeddings

https://github.com/udsclub/workshop/blob/master/notebooks/UDS-workshop-rnn.ipynb https://github.com/udsclub/alpha-sentiment-analysis/tree/master/full_movie_reviews

14

Page 56: Sentiment Analysis: best practices and challenges · Text preprocessing • NLTK – over 50 corpora, wordNet, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers

Word-based NN (LSTM)• The best simple LSTM

• Pre-trained google word2vec as embeddings

• Truncate post, bigger maxlen is better

https://github.com/udsclub/workshop/blob/master/notebooks/UDS-workshop-rnn.ipynb https://github.com/udsclub/alpha-sentiment-analysis/tree/master/full_movie_reviews

14

Page 57: Sentiment Analysis: best practices and challenges · Text preprocessing • NLTK – over 50 corpora, wordNet, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers

Word-based NN (LSTM)• The best simple LSTM

• Pre-trained google word2vec as embeddings

• Truncate post, bigger maxlen is better

• Use masking, adam optimizer, many dropouts!

https://github.com/udsclub/workshop/blob/master/notebooks/UDS-workshop-rnn.ipynb https://github.com/udsclub/alpha-sentiment-analysis/tree/master/full_movie_reviews

14

Page 58: Sentiment Analysis: best practices and challenges · Text preprocessing • NLTK – over 50 corpora, wordNet, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers

Word-based NN (LSTM)• The best simple LSTM

• Pre-trained google word2vec as embeddings

• Truncate post, bigger maxlen is better

• Use masking, adam optimizer, many dropouts!

• Have to store a big vocabulary and embeddingshttps://github.com/udsclub/workshop/blob/master/notebooks/UDS-workshop-rnn.ipynb https://github.com/udsclub/alpha-sentiment-analysis/tree/master/full_movie_reviews

14

Page 59: Sentiment Analysis: best practices and challenges · Text preprocessing • NLTK – over 50 corpora, wordNet, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers

Word-based NN (CNN)

https://github.com/udsclub/workshop/blob/master/notebooks/UDS-workshop-CNN.ipynb https://github.com/udsclub/charlie-sentiment-analysis

15

Page 60: Sentiment Analysis: best practices and challenges · Text preprocessing • NLTK – over 50 corpora, wordNet, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers

Word-based NN (CNN)

• 1D convolutions, maxpooling, dropouts, dense

https://github.com/udsclub/workshop/blob/master/notebooks/UDS-workshop-CNN.ipynb https://github.com/udsclub/charlie-sentiment-analysis

15

Page 61: Sentiment Analysis: best practices and challenges · Text preprocessing • NLTK – over 50 corpora, wordNet, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers

Word-based NN (CNN)

• 1D convolutions, maxpooling, dropouts, dense

• Better to train own embeddings

https://github.com/udsclub/workshop/blob/master/notebooks/UDS-workshop-CNN.ipynb https://github.com/udsclub/charlie-sentiment-analysis

15

Page 62: Sentiment Analysis: best practices and challenges · Text preprocessing • NLTK – over 50 corpora, wordNet, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers

Word-based NN (CNN)

• 1D convolutions, maxpooling, dropouts, dense

• Better to train own embeddings

• Stemming, removing stopwords

https://github.com/udsclub/workshop/blob/master/notebooks/UDS-workshop-CNN.ipynb https://github.com/udsclub/charlie-sentiment-analysis

15

Page 63: Sentiment Analysis: best practices and challenges · Text preprocessing • NLTK – over 50 corpora, wordNet, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers

Word-based NN (CNN)

• 1D convolutions, maxpooling, dropouts, dense

• Better to train own embeddings

• Stemming, removing stopwords

• Works worse than LSTM

https://github.com/udsclub/workshop/blob/master/notebooks/UDS-workshop-CNN.ipynb https://github.com/udsclub/charlie-sentiment-analysis

15

Page 64: Sentiment Analysis: best practices and challenges · Text preprocessing • NLTK – over 50 corpora, wordNet, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers

Char-based NN

https://github.com/udsclub/workshop/blob/master/notebooks/UDS-workshop-char-models.ipynb16

Page 65: Sentiment Analysis: best practices and challenges · Text preprocessing • NLTK – over 50 corpora, wordNet, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers

Char-based NN• Two approaches for preparing data:

• OHE (70 symbols)

• Embeddings

https://github.com/udsclub/workshop/blob/master/notebooks/UDS-workshop-char-models.ipynb16

Page 66: Sentiment Analysis: best practices and challenges · Text preprocessing • NLTK – over 50 corpora, wordNet, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers

Char-based NN• Two approaches for preparing data:

• OHE (70 symbols)

• Embeddings

• Two most popular approaches for training model:

• N*(Conv1d) + GlobalMaxPooling + Dense

• N*(Conv1d + MaxPooling) + LSTM

https://github.com/udsclub/workshop/blob/master/notebooks/UDS-workshop-char-models.ipynb16

Page 67: Sentiment Analysis: best practices and challenges · Text preprocessing • NLTK – over 50 corpora, wordNet, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers

Char-based NN• Two approaches for preparing data:

• OHE (70 symbols)

• Embeddings

• Two most popular approaches for training model:

• N*(Conv1d) + GlobalMaxPooling + Dense

• N*(Conv1d + MaxPooling) + LSTM

• Using OHE or Embeddings we don’t need to store a big vocabulary

https://github.com/udsclub/workshop/blob/master/notebooks/UDS-workshop-char-models.ipynb16

Page 68: Sentiment Analysis: best practices and challenges · Text preprocessing • NLTK – over 50 corpora, wordNet, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers

My ranking (small data)1. Word-based LSTM

2. Linear models

3. Char-based CNN + LSTM

4. FastText

5. Word-based CNN

6. Boosting

17

Page 69: Sentiment Analysis: best practices and challenges · Text preprocessing • NLTK – over 50 corpora, wordNet, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers

My ranking (big data)1. Word-based LSTM

2. Char-based CNN + LSTM

3. FastText

4. Linear models (log reg)

5. Word-based CNN

6. Boosting

18

Page 70: Sentiment Analysis: best practices and challenges · Text preprocessing • NLTK – over 50 corpora, wordNet, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers

Observations 1

19

Page 71: Sentiment Analysis: best practices and challenges · Text preprocessing • NLTK – over 50 corpora, wordNet, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers

Observations 1• Small number of short reviews

• Linear Models with BoW (many ngrams and big regularization)

19

Page 72: Sentiment Analysis: best practices and challenges · Text preprocessing • NLTK – over 50 corpora, wordNet, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers

Observations 1• Small number of short reviews

• Linear Models with BoW (many ngrams and big regularization)

• Small number of long reviews

• one layer LSTM with pre-trained google word2vec and many dropouts

19

Page 73: Sentiment Analysis: best practices and challenges · Text preprocessing • NLTK – over 50 corpora, wordNet, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers

Observations 1• Small number of short reviews

• Linear Models with BoW (many ngrams and big regularization)

• Small number of long reviews

• one layer LSTM with pre-trained google word2vec and many dropouts

• Many reviews

• LSTM and char-CNN

19

Page 74: Sentiment Analysis: best practices and challenges · Text preprocessing • NLTK – over 50 corpora, wordNet, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers

Observations 2

https://github.com/udsclub/whiskey-sentiment-analysis/blob/master/test-attention.ipynb 20

Page 75: Sentiment Analysis: best practices and challenges · Text preprocessing • NLTK – over 50 corpora, wordNet, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers

Observations 2• Begin with simple models one-layer LSTM or

Logistic Regression

https://github.com/udsclub/whiskey-sentiment-analysis/blob/master/test-attention.ipynb 20

Page 76: Sentiment Analysis: best practices and challenges · Text preprocessing • NLTK – over 50 corpora, wordNet, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers

Observations 2• Begin with simple models one-layer LSTM or

Logistic Regression

• Complex LSTMs (Bidirectional, Stacked, Merged, with Attention) don’t usually work better than simple LSTM

https://github.com/udsclub/whiskey-sentiment-analysis/blob/master/test-attention.ipynb 20

Page 77: Sentiment Analysis: best practices and challenges · Text preprocessing • NLTK – over 50 corpora, wordNet, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers

Observations 2• Begin with simple models one-layer LSTM or

Logistic Regression

• Complex LSTMs (Bidirectional, Stacked, Merged, with Attention) don’t usually work better than simple LSTM

• LSTM with Attention gives the biggest weights to the last words

https://github.com/udsclub/whiskey-sentiment-analysis/blob/master/test-attention.ipynb 20

Page 78: Sentiment Analysis: best practices and challenges · Text preprocessing • NLTK – over 50 corpora, wordNet, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers

Observations 3• Imbalanced dataset lead to the big overfitting on

smaller class on test set.

• Pay attention to F1score and classification report

• If you have many reviews, just remove some samples from the bigger class

https://github.com/udsclub/alpha-sentiment-analysis/blob/master/amazonTv/scripts/validation_curves.ipynb 21

Page 79: Sentiment Analysis: best practices and challenges · Text preprocessing • NLTK – over 50 corpora, wordNet, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers

Observations 4• Predict other domains

• use Amazon dataset (works good)

• Trained on Amazon Movie and TV 1.5kk reviews (LSTM, other models lose more than 1%)

• “Digital Music” – 95.82% “Office Products” – 95.76% “Video Games” – 94.08%

https://github.com/udsclub/alpha-sentiment-analysis/blob/master/amazonTv/scripts/validation_curves.ipynb

22

Page 80: Sentiment Analysis: best practices and challenges · Text preprocessing • NLTK – over 50 corpora, wordNet, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers

Challenges

https://github.com/udsclub/alpha-sentiment-analysis/tree/master/Enrichment%20dataset

23

Page 81: Sentiment Analysis: best practices and challenges · Text preprocessing • NLTK – over 50 corpora, wordNet, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers

Challenges• Enrich dataset with synonyms

• synonymscrawler, word2vec (the closest vector, cosine distance), wordnet (works bad)

https://github.com/udsclub/alpha-sentiment-analysis/tree/master/Enrichment%20dataset

23

Page 82: Sentiment Analysis: best practices and challenges · Text preprocessing • NLTK – over 50 corpora, wordNet, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers

Challenges• Enrich dataset with synonyms

• synonymscrawler, word2vec (the closest vector, cosine distance), wordnet (works bad)

• Transfer learning on other languages

• train on English and transfer to other languages with the same chars (works good)

https://github.com/udsclub/alpha-sentiment-analysis/tree/master/Enrichment%20dataset

23

Page 83: Sentiment Analysis: best practices and challenges · Text preprocessing • NLTK – over 50 corpora, wordNet, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers

Contact me• OpenDataScience – @vradchenko

• Facebook – https://www.facebook.com/vitaliyradchenko127

• Email – [email protected]

• UDS Club – https://github.com/udsclub

24

Page 84: Sentiment Analysis: best practices and challenges · Text preprocessing • NLTK – over 50 corpora, wordNet, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers

Thank you