lexicon-based sentiment analysis at ghc 2014

24
2014 Lexicon-Based Sentiment Analysis Using the Most- Mentioned Word Tree Bo-Hyun Kim, Sr. Software Engineer HP Big Data Business Unit Oct 10 th , 2014 #GHC14 2014

Upload: bo-hyun-kim

Post on 26-Jun-2015

215 views

Category:

Data & Analytics


6 download

DESCRIPTION

Attended Grace Hopper Celebration to present the work in Data Science Track. The presentation is on using HP Vertica Pulse and enhancing the accuracy using the right pre-processing methods and training for accuracy using the naive bayes theorem.

TRANSCRIPT

Page 1: Lexicon-Based Sentiment Analysis at GHC 2014

2014

Lexicon-Based Sentiment Analysis

Using the Most-Mentioned

Word TreeBo-Hyun Kim, Sr. Software Engineer

HP Big Data Business Unit

Oct 10th, 2014

#GHC14

2014

Page 2: Lexicon-Based Sentiment Analysis at GHC 2014

2014

What to Expect

Sentiment Analysis− What is it?− Why is it interesting?− How HP Vertica Pulse works− Achieving greater accuracy− Different point of view using the most-

mentioned word tree

Page 3: Lexicon-Based Sentiment Analysis at GHC 2014

2014

What I Expect

A 5-star rating on GHC app

I just expect you to enjoy and learn!

Page 4: Lexicon-Based Sentiment Analysis at GHC 2014

2014

Sentiment Analysis

In plain English− the process of automatically detecting if a text

segment contains emotional or opinionated content and determining its polarity (e.g., “thumbs up” or “thumbs down”), is a field of research that has received significant attention in recent years, both in academia and in industry. [Wright, 2009]

Page 5: Lexicon-Based Sentiment Analysis at GHC 2014

2014

Gimme Examples!

Also known as:− Opinion Mining− Text Mining

Determine people’s general opinion− “I just got a new car, and I’m loving it ”− “My new car isn’t as fast as I thought.”

Page 6: Lexicon-Based Sentiment Analysis at GHC 2014

2014

Why are we interested?

Increasing(every minute!) web usage− Articles− Blogs− Comments

Power of Social Media− Online Shopping− Customer Reviews− Recommended products on Amazon− How other people feel about the product

Page 7: Lexicon-Based Sentiment Analysis at GHC 2014

2014

Product Review

Page 8: Lexicon-Based Sentiment Analysis at GHC 2014

2014

Data… Data… Data…

Page 9: Lexicon-Based Sentiment Analysis at GHC 2014

2014

HP Vertica Pulse

Page 10: Lexicon-Based Sentiment Analysis at GHC 2014

2014

How to Analyze?

Lexicon-based approach – HP Labs [Zhang et. al. 2011] Choose a product, person, event, organization, or topic

[Hu and Liu, 2004] to analyze the opinion Determine the Semantic Orientation score of opinion

lexicons

Word Semantic Orientation Value

Fabulous +3

Good +1

Bad -1

Nasty -3

Page 11: Lexicon-Based Sentiment Analysis at GHC 2014

2014

Sentiment Scoring

Input: text or sentence Output: For each attribute or entity, generates a sentiment score

ranging from -1 to 1− -1: Negative sentiment− 0: Neutral sentiment− 1: Positive sentiment

Entity-level lexicon-based sentiment scoring

Page 12: Lexicon-Based Sentiment Analysis at GHC 2014

2014

Limitation

Semantic Orientation value(‘missed’) = -1 Gives more weight to the closely located

word Accuracy can suffer

Page 13: Lexicon-Based Sentiment Analysis at GHC 2014

2014

Improve accuracy

Accuracy is what we strive for! More robust pre-processing

− Prune data to fit for different types of user opinion (e.g. Twitter vs. YouTube comments)

Naïve Bayes Classifier Training Tune accordingly

Page 14: Lexicon-Based Sentiment Analysis at GHC 2014

2014

Data Set

Test dataset − Stanford students collected− In 2009− Over 3 million tweets with tested score− Analyzed 3500 tweets

Collected dataset− HP Vertica Pulse Twitter Connector− In 2014− Total of 1.2 million tweets over 30 days

Page 15: Lexicon-Based Sentiment Analysis at GHC 2014

2014

Data Pruning

Remove − Job postings

• #job, #jobs, #tweetmyjob

− Links• http://this.is/nogood

− Duplicates − Twitter specific characters

• RT, @, #

− Emoticons• I hate my life :-), sarcasm is wide-spread disease

After pruning− ~287000 tweets, 24% of the 1.2 million tweets

Page 16: Lexicon-Based Sentiment Analysis at GHC 2014

2014

Naïve Bayes Classifier

Supervised learning − Probabilistic classifier based on Bayes’ theorem− Requires a small amount of data− Assumes the presence/absence of a particular

feature of a class is unrelated to the presence/absence of any other feature

− Classifying the object based on its included features

− Open source found at [nltk.org]

Page 17: Lexicon-Based Sentiment Analysis at GHC 2014

2014

Naïve Bayes Classifier

Results: − Final accuracy : 0.788

Page 18: Lexicon-Based Sentiment Analysis at GHC 2014

2014

Tuning Pulse

Positive words Negative words Neutral words White lists Stop words Synonym mappings

Page 19: Lexicon-Based Sentiment Analysis at GHC 2014

2014

Accuracy Comparison

Sentiment scores generated for each phase

Keyword Ideal Original Pruning Training Tuning

Healthcare -0.1515 -0.0333 -0.0833 -0.1 -0.125

Obama 0.308 0.0944 0.1535 0.1535 0.1842

Page 20: Lexicon-Based Sentiment Analysis at GHC 2014

2014

Trend/Targeted Analysis

Targeted dataset analysis can help improve accuracy Identify the most-mentioned words

− Use the most-recurrent words to narrow the scope of analysis

Find new trends − Government healthcare (2009) vs. Obamacare (2014)

Are we looking at the targeted data?− “Solve healthcare challenges with technology!” − “Healthcare After ObamaCare”− “Get affordable healthcare at HealthCare.gov”

Page 21: Lexicon-Based Sentiment Analysis at GHC 2014

2014

Generating Tree

Increase the relevancy of sentiment score by running the sentiment analysis on the entity, as well as on the most-recurrent words to identify: − Homonyms that machines do not understand− More accurate scores based on user interest

Generate tree using Text Search− Merge stemmer words

e.g. query, queries, querying…− Lucene - apache open source

Page 22: Lexicon-Based Sentiment Analysis at GHC 2014

2014

Tree View

healthcare

obamacare !(Obamacare)

obama !(Obama) !(health)health

Page 23: Lexicon-Based Sentiment Analysis at GHC 2014

2014

Thank you

[email protected]

[email protected]

Many thanks to*:Tim Donar, Solution Engineer

Beth Favini, Tech Pubs Sr. Manager

Judith Plummer, Tech Pubs Editor in Chief

* In alphabetical order

Page 24: Lexicon-Based Sentiment Analysis at GHC 2014

2014

Got Feedback?

Rate and Review the session using the GHC Mobile App

To download visit www.gracehopper.org