machine learning from movie reviews - long form

Post on 13-Apr-2017

163 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Machine LearningFrom Movie ReviewsJennifer Dunne

Why Movie Reviews?• Natural Language Processing is hot

• There are real world use cases

• It plays to my domain knowledge outsideof data science

Natural Language Processing Use CasesWhat Are

People Saying About Me?

Who Are These

People?

Sentiment AnalysisWhat Are

People Saying About Me?

Good Bad

Customer Segmentation

Who Are These

People?Animated Action Comedy Drama

Family Fantasy Horror Musical

Mystery Romance Sci-Fi Thriller

War Western

Sentiment Analysis Testing

Data

• 100,000 movie reviews from IMDB

• Training set of 12,500 positive reviews (7-10 stars)

• Training set of 12,500 negative reviews (<5 stars)

• 30 or fewer reviews per movie

Methods

• Bag of words (Sklearn TF/IDF)

• Word2Vec (Gensim)

• Doc2Vec (Gensim)

• Pattern

• Indico (by sentence)

• Indico (by document)

• Indico (High Quality Sentiment)

Speed• How long does it take to

train?◦Preparing text ◦ Training machine learning

model ◦Some models are pre-trained

• How quickly does it analyze?◦Preparing text ◦Running text through the

trained model

How well does it do?• Accuracy

◦Did we correctly name positive sentiment as positive?

◦Did we correctly name negative sentiment as negative?

◦Better for even class distribution

• F1 = (Precision + Recall) / 2◦Precision = percent of things

we called positive that were actually positive

◦Recall = percent of things that were actually positive that we called positive

◦Better for uneven class distribution

Bag Of Words (Sklearn TFIDF)• Simple algorithm

• Fast to train (10 minutes)

• Fast to apply

• 85.3% accuracy

• 85.3% F1

Word2Vec (Gensim)• More complex algorithm

• Computationally intensive

• Better results with larger training sets, multiple epochs

• Slow to train (2 hours)

• Slow to apply

• 81.9% accuracy

• 82.2% F1

Doc2Vec (Gensim)• More complex algorithm

• Computationally intensive

• Better results with larger training sets, multiple epochs

• Slow to train (4 hours)

• Slow to apply

• 82.8% accuracy

• 82.8% F1

• Distributed bag of words

• (other models 70% and 82% accuracy rates)

Pattern (built-in)• Simple algorithm

• Part of the Pattern module

• No training required

• Fast to apply

• 76.4% accuracy

• 76.9% F1

• (lowest scores)

Indico (by sentence)• More complex algorithm

• API calls to proprietary system

• No training required

• Fast to apply

• 89.1% accuracy

• 88.9% F1

Indico (by document)• Simple algorithm

• API calls to proprietary system

• No training required

• Fast to apply

• 90.1% accuracy

• 90.0% F1

Indico (High Quality Sentiment)• Simple algorithm

• API calls to proprietary system

• No training required

• Slow to apply

• 93.2% accuracy

• 93.2% F1

• (highest scores)

Indico Sent_HQ

Indico by doc.

Doc2Vec

Pattern

Indico by sent.

Word2Vec

Bag of Words

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Accuracy F1

Comparison of Sentiment Prediction

Customer Segmentation

Who Are These

People?Animated Action Comedy Drama

Family Fantasy Horror Musical

Mystery Romance Sci-Fi Thriller

War Western

Customer Segmentation Testing

Data

• 129,809 movie reviews from IMDB

• 3,323 different movies

• 510 different combinations of genre

• 40 or fewer reviews per movie

Methodology

◦ Transform each review into an Indico document vector

◦ Test success of different document criteria

◦ Test success of different models◦ Optimize best model

What Is Random Chance?• 510 different genre combinations

• Heavily weighted to negative

• F1 random chance < 20%

Animated Action Comedy Drama

Family Fantasy Horror Musical

Mystery Romance Sci-Fi Thriller

War Western

More Reviews or More Words?Number of Reviews Word Length F1 Score

129,809 All Reviews .5472,691 200+ .5619,914 500+ .57

Conclusion: Longer reviews work better

Optimizing ModelsModel F1 Score

Tuned Random Forest .57Initial Logistic Regression, Initial Linear SVC .62

Tuned Logistic Regression, Tuned Linear SVC .63Initial Gradient Boost .63Tuned Gradient Boost .67

Conclusion: Choosing the right model matters more than tuning

More From Customer Segmentation

Genre

Review 1

Review 2

Review 3

Feature 1

Feature 2

Feature 3

Genre

Some Customer Segments Overlap

Animated Family

Both: 1297 reviews Animated: 395 reviews Family: 1090 reviews

Segments Care About Different Things

Horror War

Machine Learning From Movie Reviews• See the complete set of word clouds at:◦Github/JenniferDunne

• Contact:◦Jennifer.dunne.co@gmail.com◦Linkedin/jenniferdunneco

top related