data mining on yelp dataset

Data Crackers on YELP Dataset- Prashanth Sandela (PS)- Vimal Chandra Gorijala (VG)- Parineetha Tirumali Gandhi (PG)

Agenda

Project Vision

Data Mining Tasks

Hypothesis

Data

Solutions

Experiment results

Models Comparison

Conclusion

Timelines

Project Vision

Yelp Dataset has variety of businesses on which user gives reviews and starrating.

Our task is to Classify User Rating Stars based on user review text, businesses and users.

We used various Data Mining models to classify user star ratings by applyingvarious model tuning techniques to attain optimal accuracy of classifying.

Data Mining Task

This kind of classification comes under Multi-Class Classification.

Data Processing: Converting to CSV Stop word Removal Special Character removal Lower case conversion Consolidation of dataset

Formulated Models Naïve Bayes implementation in HIVE Naïve Bayes Multinomial Classification using WEKA Naïve Bayes Multinomial Text (Best Case) Decision Tree KNN

Evaluation Metrics: Ngrams Features Percentage Division of Test and Training Data Accuracy

Hypothesis

As it is classification which is based on text, Naïve Bayes classification could begood model to be considered.

Dividing text into ngrams will increase accuracy. But, uni-grams should give bestaccuracy. This assumption is wrong, as bi-grams gave better accuracy.

Use of both business id and user id together will not give better accuracy whenconsidered individual features.

Data

213509163761

748188

NEGATIVE NEUTRAL POSITIVE

Review Stars

Count

66.5%

13.5%20%

It contains 1.3 Million reviews, 40 Thousand business and 250Thousand users.

Classify review stars based on review text. Stars are from points1 to 5. To reduce the problem, we reduce stars 4, 5 as 1, stars 3as 0 and stars 1, 2 as -1

Features: Business Id Review Id User Id Review text Stars (Ratings)

Classification Negative Neutral Positive

Tools and Technologies Used• Pentaho Data Integration• HIVE( In AWS using S3 and Ec2)• WEKA• Experimented with H2O

Solution

Naïve Bayes(Implementation in HIVE)Implementation of Naive Bayes can be done in HIVE and doing it in HIVE is scalable and there won't be any limitation on size of

data. Huge task would be to query everything.

Naïve Bayes multinomial(WEKA)It is a special type classifier which uses a multinomial distribution for each of the features. This model is mainly useful for multiclassclassification. Initially we have 5 classes to classify the reviews but we have reduced them into three (positive, neutral, negative).

Naïve Bayes multinomial Text (WEKA)It a special type of Bayesian classifier which operates only on string attributes. It suites best for the text data. Other types of

attributes are accepted but are ignored during training and classification. It uses word frequencies rather than bag of words Representation.

Decision Tree (WEKA)

Instances are described with fixed set of attributes and their values.

Suited for almost all kinds of inputs like text, numeric and nominal data

Easily extend to learning function with more than two possible outcomes.

Learning methods are robust to errors

KNN (WEKA)Classification of unknown instances can be done by relating the unknown to the known according to some distance/similarity

function.

Experiment Results(Naïve Bayes Multinomial & Multinomial Text)

Naïve Bayes Multinomial Naïve Bayes Multinomial Text

Initial % 48 54

Stopwords 53 56

Stemmer 54 59

Unigrams 59 61

Min Word Frequency from 5 -10 65 66

Bigrams 70 71

Trigrams 63 62

Business and user id 63.5 63

Business id 74 75

User id 74 74.7

Attribute Selection Filter 78 78.4

Bag of Words Count 79

Over all Accuracy 79.49 79.6

Experiment Results( KNN & Decision Tree)

5NN Decision TreeInitial % 68.2535 72.4138Stopwords 68.2535 72.6521stemmer 68.2535 72.9523unigrams 64.5532 73.1538bigrams 65.5532 73.3251trigrams 65.5532 72.5216business and userid 66.3256 69.5364business id 69.6529 73.3526user id 70.1253 74.2596bag of words 71.1253 74.2596overall accuracy 71.1253 74.2596

Manhattan Euclidean

Bi/Trigrams 65.5131 64.5532

Stemmer(Lovins) 68.2535 68

Words to Keep(5000) 71.1253 70

Experiment Results( Naïve Bayes in HIVE)

Sl. No Action *Probability Model

1 Initial Dataset 44%

2 Refining of Training and Test Data 7%

3 Change of Stemmer 0.50%

4

Ngrams:

Unigrams 3%

Bigrams 7%

Trigrams 2%

5

Including Features

Business id and User id -1%

Business id 2%

User id 4%

6 Bag of words 5%

7 Overall Accuracy on 100,000 records ~72%

8 Accuracy on complete dataset ~74%

Experiment Results(Naïve Bayes Multinomial)

Sl. No Action *Naïve Bayes Multinomial

1 Initial Dataset 46%

2 Refining of Training and Test Data N/A

3 Change of Stemmer N/A

4

Ngrams:

Unigrams 3.50%

Bigrams 7.50%

Trigrams 2%

5

Including Features

Business id and User id 1%

Business id 3%

User id 5%

6 Bag of words 4%

7 Overall Accuracy on 100,000 records ~76%

8 Accuracy on complete dataset N/A

Comparison

Use Case(Why not business id and User id together)

Business id User id Review text Stars

1 1 Laptop is good 1

1 2 HP is bad 0

1 3 Lenovo is good 1

2 1 Pizza is good 1

2 4 Pizza is bad 0

3 1 Product is bad 0

Training Data

Business id User id Review text

1 1 Product is good

2 4 Pizza is good

Test Data

Bus id Words Probability Stars

1 Laptop, HP,Lenovo

0.15 1

1 Good 0.35 1

1 Bad 0.15 0

2 Pizza 0.5 1

2 Good 0.25 1

2 Bad 0.25 0

Business id

User id Words Probability Stars

1 Laptop, Pizza 0.16 1

1 Product 0.16 0

1 Good 0.32 1

1 Bad 0.16 0

User id

Project Management

PG

VG

PS

2014Week 1 3 5 7 9 11 13

Project Proposal

9/11/2014

Report 1 - Initial Attempt

9/30/2014

Report 2 - Data Processing & Feature Selection

10/28/2014

Report 3 - Model Selection & Tuning

11/27/2014

Project Proposal Initial Attempt Data Processing & Feature Selection Model Selection & Tuning

Dataset Decision

Data Mining Problems

Key Attributes

Machine Learning Models

Data Sampling

Data Cleaning

Model Selection and implementation

Future Task

Data Quality Problems

Data Processing

Feature Selection

Feature Extraction

Model Selection

Model Selection

Model Selection Model Tuning & Results Comparison

Model Tuning & Results Comparison

Model Tuning & Results ComparisonDataset Decision

Dataset Decision

Feature Selection

Feature Selection

Thank You!!

Questions…?

data mining on yelp dataset

Data & Analytics