comparing machine learning algorithms in text mining

Sentiment Analysis & Opinion

Mining Projectwork

Comparing Models for Review Classification,

Counting Stars, Sentiment Quantification and

Fake Review Detection

Andrea Gigli

https://about.me/andrea.gigli

The goal

Comparing different Machine Learning Algorithm on different Text Mining Tasks

Tasks considered:

1) Classifying Positive and Negative Reviews

2) Predicting Review Stars

3) Quantifying Sentiment Over Time

4) Detecting Fake Reviews

Tools:

Python + NLTK + Scikit-learn

ML Models: Naïve Bayes

Naïve Bayes is a probabilistic learner that uses the Bayes

Theorem:

� � � = � � � � ��

making a strong independence assumption between the

features.

�(�|�) ∝ �(�)�(��|�)

ML Models: SVM

Support Vector Machine (SVM) is a geometric learner that

represent the set of features F in a |F|-dimensional vector

space:

Vectors w are composed of �� ′s which indicate the relevance

of feature f in document d

The algorithm compute the hyperplane

� ∙ � − � = 0

that better separates the examples.

ML Models: Decision Trees

Decision Tree algorithm generate a

tree of yes/no question on

features.

It performs a feature selection by

maximizing an Information Gain

measure:

�� = � � − �(�|�)

ML Models: Random Forest

Random forests are an ensemble learning method

They operate by constructing a multitude of decision

trees at training time and outputting the class that is

the mode of the class.

ML Models: Adaptive Boosting

Adaptive Boosting is a meta-algorithm which can be used in

conjunction with other types of learning algorithms to improve

their performance.

The output of “weak learners” is combined into a weighted sum

that represents the final output of the boosted classifier.

It is “Adaptive” in the sense that subsequent weak learners are

tweaked in favor of those instances misclassified by previous

classifiers.

(1) Classifying Reviews

We want to classify a Review as Positive or Negative

Data contain movie reviews labeled as Positive or Negative and you can find them here:

http://www.cs.cornell.edu/people/pabo/movie-review-data/review_polarity.tar.gz (set1)

http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz (set2)

Kfold method with k=10 is applied

A separate comparison has been performed introducing lexicon features through SensiWordNet

Measuring Model Performance for

task (1)Predicted labels are compared to true labels of

test set. Hence a contingency table is built:

TP(True Positive)

FP(False Positive)

P*(Predicted Positive)

FN(False Negative)

TN(True Negative)

N*(Predicted Negative)

P(Total Positive)

N(Total Negative)

D(Total Documents)


task (1)

• Accuracy,�� + ��

�• Recall, ability to find positive documents

��∗

• Precision, accuracy on positive documents��

• F1, harmonic mean of precision and recall2��

2�� + !� + !�

ML in Review Classification (set 1)

Kfold Accuracy Kfold Recall Kfold Precision Kfold F1

NaiveBayes

(Bernoulli)0.863 (1726/2000) 0.873 (850/974) 0.850 (850/1000) 0.861 (1700/1974)

SVM (Linear) 0.853 (1705/2000) 0.844 (865/1025) 0.865 (865/1000) 0.854 (1730/2025)

DecisionTree 0.622 (1243/2000) 0.631 (586/929) 0.586 (586/1000) 0.608 (1172/1929)

RandomForest 0.713 (1425/2000) 0.792 (576/727) 0.576 (576/1000) 0.667 (1152/1727)

AdaBoost 0.758 (1516/2000) 0.775 (727/938) 0.727 (727/1000) 0.750 (1454/1938)

With Lexicon Features


NaiveBayes

(Bernoulli)0.850 (1699/2000) 0.871 (820/941) 0.820 (820/1000) 0.845 (1640/1941)

SVM (Linear) 0.836 (1671/2000) 0.834 (837/1003) 0.837 (837/1000) 0.836 (1674/2003)

DecisionTree 0.657 (1315/2000) 0.658 (655/995) 0.655 (655/1000) 0.657 (1310/1995)

RandomForest 0.720 (1439/2000) 0.785 (604/769) 0.604 (604/1000) 0.683 (1208/1769)

AdaBoost 0.764 (1528/2000) 0.757 (778/1028) 0.778 (778/1000) 0.767 (1556/2028)

ML in Review Classification (set 2)


NaiveBayes

(Bernoulli)0.825 (20625/25000) 0.872 (9529/10933) 0.762 (9529/12500) 0.813 (19058/23433)

SVM (Linear) 0.876 (21908/25000) 0.885 (10814/12220) 0.865 (10814/12500) 0.875 (21628/24720)

DecisionTree 0.708 (17712/25000) 0.713 (8711/12210) 0.697 (8711/12500) 0.705 (17422/24710)

RandomForest 0.756 (18898/25000) 0.801 (8521/10644) 0.682 (8521/12500) 0.736 (17042/23144)

AdaBoost 0.801 (20018/25000) 0.785 (10365/13212) 0.829 (10365/12500) 0.806 (20730/25712)

With Lexicon Features


NaiveBayes

(Bernoulli)0.825 (20625/25000) 0.872 (9530/10935) 0.762 (9530/12500) 0.813 (19060/23435)

SVM (Linear) 0.881 (22015/25000) 0.891 (10840/12165) 0.867 (10840/12500) 0.879 (21680/24665)

DecisionTree 0.704 (17589/25000) 0.705 (8745/12401) 0.700 (8745/12500) 0.702 (17490/24901)

RandomForest 0.757 (18924/25000) 0.814 (8332/10240) 0.667 (8332/12500) 0.733 (16664/22740)

AdaBoost 0.804 (20096/25000) 0.780 (10581/13566) 0.846 (10581/12500) 0.812 (21162/26066)

(2) Predicting Review Stars

We want to predict the score associated to a review.

Data contain scoring (from 1 to 5) and reviews from Amazon and TripAdvisor and they are available at:

http://patty.isti.cnr.it/~baccianella/reviewdata/corpus/Amazon_corpus.zip(set 1)

http://patty.isti.cnr.it/~baccianella/reviewdata/corpus/TripAdvisor_corpus.zip (set 2)

We used bigrams as an additional feature.


task (2)

Let Φ be the true classification function and Φ# the

learning algorithm

$%& Φ#, �'()*') = 1|�'()*')| , |Φ# �- −Φ �- |

�./0123413

$*& Φ#, �'()*') = 1|�'()*')| , Φ# �- −Φ �- 5

�./0123413

ML in Counting Stars (set 1)

F1 - 1 F1 - 2 F1 - 3 F1 - 4 F1 - 5 Acc MAE MSE

Support Vector Classifier -Bigrams 0.71 0.19 0.15 0.42 0.76 0.61 0.595 1.21

+Bigrams 0.72 0.17 0.10 0.43 0.76 0.62 0.589 0.59

Linear Regression -Bigrams 0.43 0.25 0.26 0.40 0.61 0.45 0.704 1.07

+Bigrams 0.54 0.24 0.25 0.39 0.65 0.48 0.669 1.04

SVC Regression -Bigrams 0.37 0.24 0.27 0.42 0.62 0.45 0.68 0.98

+Bigrams 0.42 0.24 0.26 0.41 0.64 0.46 0.67 0.98

Decision Tree with

BernoulliNB-Bigrams 0.70 0.25 0.20 0.47 0.75 0.60 0.56 1.05

+Bigrams 0.72 0.28 0.11 0.47 0.76 0.61 0.55 1.01

ML in Counting Stars (set 2)

F1 - 1 F1 - 2 F1 - 3 F1 - 4 F1 - 5 Acc MAE MSE

Support Vector Classifier - Bigrams 0.54 0.37 0.28 0.58 0.75 0.62 0.46 0.66

+Bigrams 0.48 0.38 0.24 0.56 0.74 0.61 0.47 0.67

Linear Regression - Bigrams 0.21 0.20 0.27 0.52 0.66 0.53 0.61 0.96

+ Bigrams 0.32 0.29 0.29 0.53 0.70 0.55 0.56 0.86

SVC Regression - Bigrams 0.16 0.31 0.37 0.55 0.67 0.55 0.49 0.60

+ Bigrams 0.09 0.26 0.34 0.56 0.68 0.56 0.50 0.63

Decision Tree with

BernoulliNB- Bigrams 0.58 0.46 0.36 0.59 0.74 0.62 0.41 0.51

+ Bigrams 0.52 0.47 0.35 0.60 0.76 0.63 0.40 0.49

(3) Quantification Task

We want to understand the “user’s sentiment” on each day,

using the percentage of daily positive reviews as a proxy.

Data contains Positive and Negative Reviews collected over 5

days for Kindle Fire and Harry Potter Book. You can download

them here https://www.dropbox.com/s/x512wqnzp1v2xa9/quantificationdata.zip?dl=0

20%

30%

40%

50%

60%

70%

80%

90%

0 2 4 6

20%

30%

40%

50%

60%

70%

80%

90%

0 2 4 6

Positive Review Percentage


task (3)

• Classify and Count (CC)

�6'�7�)'��8(7)79'�8):;6'97'(

• Probabilistic Classify and Count (PCC)

∑ �68�:�7;7)=6'97'7 − )>7(�8(7)79'-∈@,AB1C-1D2�8):;6'97'(


task (3)• Adjusted CC (ACC)

EEAFGHIGHAFGH where !�J = KL

M and !�J = 0LL

• Probabilistic ACC (PACC)

GEEAGFGHGIGHAGFGH where �!�J = LKL

LKL@L0M , ��J = L0LL0L@LKM and

PFP = , �68�:�7;7)=6'97'7 − )>7(�8(7)79'-∈M1PQ3-C1B1C-1D2

PTN = , �68�:�7;7)=6'97'7 − )>7(T'U:)79'-∈M1PQ3-C1B1C-1D2

PTP = , �68�:�7;7)=6'97'7 − )>7(�8(7)79'-∈LV2-3-C1B1C-1D2

PFN = , �68�:�7;7)=6'97'7 − )>7(T'U:)79'-∈LV2-3-C1B1C-1D2

ML in Quantification

Kindle Reviews, Trainset features count: 11801 Training set prevalence = 0.781

MSE(CC) MSE(ACC) MSE(PCC) MSE(PACC)

Bernoulli 14% 1% 14% 35%

SGD 9% 1% 4% 8%

SVC 12% 1% 2% 0%

DecisionTree 2% 3% 2% 4%

RandomForest 20% 1% 6% 24%

AdaBoost 5% 1% 30% 288%

Harry Potter Reviews, Trainset features count: 11165 Training set prevalence = 0.795

MSE(CC) MSE(ACC) MSE(PCC) MSE(PACC)

BernoulliNB 3.56% 101.65% 3.71% 11.32%

SGD 8.28% 4.90% 3.89% 3.23%

SVC 15.92% 23.32% 9.36% 51.81%

DecisionTree 5.26% 12.01% 5.26% 3.75%

RandomForest 34.41% 4.67% 8.79% 17.34%

AdaBoost 3.55% 16.15% 34.44% 284.02%

v

Predicting Sentiment

60%

65%

70%

75%

80%

85%

90%

95%

100%

0 2 4 6

% True Positive

Reveiws

CC

ACC

PCC

PACC

20%

30%

40%

50%

60%

70%

80%

90%

100%

0 2 4 6

% True Positive

Reveiws

CC

ACC

PCC

PACC

Kindle Fire True and

Predicted Positive

Review Percentage in 5

days using Decision Tree

HP Reviews True and

Predicted Positive

Review Percentage in 5

days using Decision

Tree

(4) Fake Review Detection Task

We want to classify a review as Real or Fake

Data consist of truthful and deceptive reviews from TripAdvisor, Mechanical Turk, Expedia, Hotels.com, Orbitz, Priceline and Yelp for the 20 most popular Chicago hotels. They are available here:

http://myleott.com/op_spam/

Kfold method with k=10 is applied

(4) Fake Review Detection Task


PO

SIT

IVE

RE

VIE

W LinearSVC 0.899 (719/800) 0.906 (356/393) 0.890 (356/400) 0.898 (712/793)

BernoulliNB 0.866 (693/800) 0.945 (311/329) 0.777 (311/400) 0.853 (622/729)

SGDClassifier 0.864 (691/800) 0.870 (342/393) 0.855 (342/400) 0.863 (684/793)

RandomForest 0.864 (691/800) 0.888 (333/375) 0.833 (333/400) 0.859 (666/775)

AdaBoost 0.825 (660/800) 0.837 (323/386) 0.807 (323/400) 0.822 (646/786)

NE

GA

TIV

E R

EV

IEW LinearSVC 0.896 (717/800) 0.903 (355/393) 0.887 (355/400) 0.895 (710/793)

BernoulliNB 0.861 (689/800) 0.926 (314/339) 0.785 (314/400) 0.850 (628/739)

SGDClassifier 0.880 (704/800) 0.906 (339/374) 0.848 (339/400) 0.876 (678/774)

RandomForest 0.850 (680/800) 0.893 (318/356) 0.795 (318/400) 0.841 (636/756)

AdaBoost 0.772 (618/800) 0.771 (310/402) 0.775 (310/400) 0.773 (620/802)

Thanks!

Andrea Giglihttps://about.me/andrea.gigli

comparing machine learning algorithms in text mining

Data & Analytics