machine learning from movie reviews - long form

25
Machine Learning From Movie Reviews Jennifer Dunne

Upload: jennifer-dunne

Post on 13-Apr-2017

163 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Machine Learning From Movie Reviews - Long Form

Machine LearningFrom Movie ReviewsJennifer Dunne

Page 2: Machine Learning From Movie Reviews - Long Form

Why Movie Reviews?• Natural Language Processing is hot

• There are real world use cases

• It plays to my domain knowledge outsideof data science

Page 3: Machine Learning From Movie Reviews - Long Form

Natural Language Processing Use CasesWhat Are

People Saying About Me?

Who Are These

People?

Page 4: Machine Learning From Movie Reviews - Long Form

Sentiment AnalysisWhat Are

People Saying About Me?

Good Bad

Page 5: Machine Learning From Movie Reviews - Long Form

Customer Segmentation

Who Are These

People?Animated Action Comedy Drama

Family Fantasy Horror Musical

Mystery Romance Sci-Fi Thriller

War Western

Page 6: Machine Learning From Movie Reviews - Long Form

Sentiment Analysis Testing

Data

• 100,000 movie reviews from IMDB

• Training set of 12,500 positive reviews (7-10 stars)

• Training set of 12,500 negative reviews (<5 stars)

• 30 or fewer reviews per movie

Methods

• Bag of words (Sklearn TF/IDF)

• Word2Vec (Gensim)

• Doc2Vec (Gensim)

• Pattern

• Indico (by sentence)

• Indico (by document)

• Indico (High Quality Sentiment)

Page 7: Machine Learning From Movie Reviews - Long Form

Speed• How long does it take to

train?◦Preparing text ◦ Training machine learning

model ◦Some models are pre-trained

• How quickly does it analyze?◦Preparing text ◦Running text through the

trained model

Page 8: Machine Learning From Movie Reviews - Long Form

How well does it do?• Accuracy

◦Did we correctly name positive sentiment as positive?

◦Did we correctly name negative sentiment as negative?

◦Better for even class distribution

• F1 = (Precision + Recall) / 2◦Precision = percent of things

we called positive that were actually positive

◦Recall = percent of things that were actually positive that we called positive

◦Better for uneven class distribution

Page 9: Machine Learning From Movie Reviews - Long Form

Bag Of Words (Sklearn TFIDF)• Simple algorithm

• Fast to train (10 minutes)

• Fast to apply

• 85.3% accuracy

• 85.3% F1

Page 10: Machine Learning From Movie Reviews - Long Form

Word2Vec (Gensim)• More complex algorithm

• Computationally intensive

• Better results with larger training sets, multiple epochs

• Slow to train (2 hours)

• Slow to apply

• 81.9% accuracy

• 82.2% F1

Page 11: Machine Learning From Movie Reviews - Long Form

Doc2Vec (Gensim)• More complex algorithm

• Computationally intensive

• Better results with larger training sets, multiple epochs

• Slow to train (4 hours)

• Slow to apply

• 82.8% accuracy

• 82.8% F1

• Distributed bag of words

• (other models 70% and 82% accuracy rates)

Page 12: Machine Learning From Movie Reviews - Long Form

Pattern (built-in)• Simple algorithm

• Part of the Pattern module

• No training required

• Fast to apply

• 76.4% accuracy

• 76.9% F1

• (lowest scores)

Page 13: Machine Learning From Movie Reviews - Long Form

Indico (by sentence)• More complex algorithm

• API calls to proprietary system

• No training required

• Fast to apply

• 89.1% accuracy

• 88.9% F1

Page 14: Machine Learning From Movie Reviews - Long Form

Indico (by document)• Simple algorithm

• API calls to proprietary system

• No training required

• Fast to apply

• 90.1% accuracy

• 90.0% F1

Page 15: Machine Learning From Movie Reviews - Long Form

Indico (High Quality Sentiment)• Simple algorithm

• API calls to proprietary system

• No training required

• Slow to apply

• 93.2% accuracy

• 93.2% F1

• (highest scores)

Page 16: Machine Learning From Movie Reviews - Long Form

Indico Sent_HQ

Indico by doc.

Doc2Vec

Pattern

Indico by sent.

Word2Vec

Bag of Words

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Accuracy F1

Comparison of Sentiment Prediction

Page 17: Machine Learning From Movie Reviews - Long Form

Customer Segmentation

Who Are These

People?Animated Action Comedy Drama

Family Fantasy Horror Musical

Mystery Romance Sci-Fi Thriller

War Western

Page 18: Machine Learning From Movie Reviews - Long Form

Customer Segmentation Testing

Data

• 129,809 movie reviews from IMDB

• 3,323 different movies

• 510 different combinations of genre

• 40 or fewer reviews per movie

Methodology

◦ Transform each review into an Indico document vector

◦ Test success of different document criteria

◦ Test success of different models◦ Optimize best model

Page 19: Machine Learning From Movie Reviews - Long Form

What Is Random Chance?• 510 different genre combinations

• Heavily weighted to negative

• F1 random chance < 20%

Animated Action Comedy Drama

Family Fantasy Horror Musical

Mystery Romance Sci-Fi Thriller

War Western

Page 20: Machine Learning From Movie Reviews - Long Form

More Reviews or More Words?Number of Reviews Word Length F1 Score

129,809 All Reviews .5472,691 200+ .5619,914 500+ .57

Conclusion: Longer reviews work better

Page 21: Machine Learning From Movie Reviews - Long Form

Optimizing ModelsModel F1 Score

Tuned Random Forest .57Initial Logistic Regression, Initial Linear SVC .62

Tuned Logistic Regression, Tuned Linear SVC .63Initial Gradient Boost .63Tuned Gradient Boost .67

Conclusion: Choosing the right model matters more than tuning

Page 22: Machine Learning From Movie Reviews - Long Form

More From Customer Segmentation

Genre

Review 1

Review 2

Review 3

Feature 1

Feature 2

Feature 3

Genre

Page 23: Machine Learning From Movie Reviews - Long Form

Some Customer Segments Overlap

Animated Family

Both: 1297 reviews Animated: 395 reviews Family: 1090 reviews

Page 24: Machine Learning From Movie Reviews - Long Form

Segments Care About Different Things

Horror War

Page 25: Machine Learning From Movie Reviews - Long Form

Machine Learning From Movie Reviews• See the complete set of word clouds at:◦Github/JenniferDunne

• Contact:◦[email protected]◦Linkedin/jenniferdunneco