machine learning from movie reviews - long form
TRANSCRIPT
Machine LearningFrom Movie ReviewsJennifer Dunne
Why Movie Reviews?• Natural Language Processing is hot
• There are real world use cases
• It plays to my domain knowledge outsideof data science
Natural Language Processing Use CasesWhat Are
People Saying About Me?
Who Are These
People?
Sentiment AnalysisWhat Are
People Saying About Me?
Good Bad
Customer Segmentation
Who Are These
People?Animated Action Comedy Drama
Family Fantasy Horror Musical
Mystery Romance Sci-Fi Thriller
War Western
Sentiment Analysis Testing
Data
• 100,000 movie reviews from IMDB
• Training set of 12,500 positive reviews (7-10 stars)
• Training set of 12,500 negative reviews (<5 stars)
• 30 or fewer reviews per movie
Methods
• Bag of words (Sklearn TF/IDF)
• Word2Vec (Gensim)
• Doc2Vec (Gensim)
• Pattern
• Indico (by sentence)
• Indico (by document)
• Indico (High Quality Sentiment)
Speed• How long does it take to
train?◦Preparing text ◦ Training machine learning
model ◦Some models are pre-trained
• How quickly does it analyze?◦Preparing text ◦Running text through the
trained model
How well does it do?• Accuracy
◦Did we correctly name positive sentiment as positive?
◦Did we correctly name negative sentiment as negative?
◦Better for even class distribution
• F1 = (Precision + Recall) / 2◦Precision = percent of things
we called positive that were actually positive
◦Recall = percent of things that were actually positive that we called positive
◦Better for uneven class distribution
Bag Of Words (Sklearn TFIDF)• Simple algorithm
• Fast to train (10 minutes)
• Fast to apply
• 85.3% accuracy
• 85.3% F1
Word2Vec (Gensim)• More complex algorithm
• Computationally intensive
• Better results with larger training sets, multiple epochs
• Slow to train (2 hours)
• Slow to apply
• 81.9% accuracy
• 82.2% F1
Doc2Vec (Gensim)• More complex algorithm
• Computationally intensive
• Better results with larger training sets, multiple epochs
• Slow to train (4 hours)
• Slow to apply
• 82.8% accuracy
• 82.8% F1
• Distributed bag of words
• (other models 70% and 82% accuracy rates)
Pattern (built-in)• Simple algorithm
• Part of the Pattern module
• No training required
• Fast to apply
• 76.4% accuracy
• 76.9% F1
• (lowest scores)
Indico (by sentence)• More complex algorithm
• API calls to proprietary system
• No training required
• Fast to apply
• 89.1% accuracy
• 88.9% F1
Indico (by document)• Simple algorithm
• API calls to proprietary system
• No training required
• Fast to apply
• 90.1% accuracy
• 90.0% F1
Indico (High Quality Sentiment)• Simple algorithm
• API calls to proprietary system
• No training required
• Slow to apply
• 93.2% accuracy
• 93.2% F1
• (highest scores)
Indico Sent_HQ
Indico by doc.
Doc2Vec
Pattern
Indico by sent.
Word2Vec
Bag of Words
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Accuracy F1
Comparison of Sentiment Prediction
Customer Segmentation
Who Are These
People?Animated Action Comedy Drama
Family Fantasy Horror Musical
Mystery Romance Sci-Fi Thriller
War Western
Customer Segmentation Testing
Data
• 129,809 movie reviews from IMDB
• 3,323 different movies
• 510 different combinations of genre
• 40 or fewer reviews per movie
Methodology
◦ Transform each review into an Indico document vector
◦ Test success of different document criteria
◦ Test success of different models◦ Optimize best model
What Is Random Chance?• 510 different genre combinations
• Heavily weighted to negative
• F1 random chance < 20%
Animated Action Comedy Drama
Family Fantasy Horror Musical
Mystery Romance Sci-Fi Thriller
War Western
More Reviews or More Words?Number of Reviews Word Length F1 Score
129,809 All Reviews .5472,691 200+ .5619,914 500+ .57
Conclusion: Longer reviews work better
Optimizing ModelsModel F1 Score
Tuned Random Forest .57Initial Logistic Regression, Initial Linear SVC .62
Tuned Logistic Regression, Tuned Linear SVC .63Initial Gradient Boost .63Tuned Gradient Boost .67
Conclusion: Choosing the right model matters more than tuning
More From Customer Segmentation
Genre
Review 1
Review 2
Review 3
Feature 1
Feature 2
Feature 3
Genre
Some Customer Segments Overlap
Animated Family
Both: 1297 reviews Animated: 395 reviews Family: 1090 reviews
Segments Care About Different Things
Horror War
Machine Learning From Movie Reviews• See the complete set of word clouds at:◦Github/JenniferDunne
• Contact:◦[email protected]◦Linkedin/jenniferdunneco