yunchao he icot2015

SENTIMENT CLASSIFICATION OF SHORT TEXTS BASED ON SEMANTIC CLUSTERING (PAPER ID: 61)

Yunchao HeChin-Sheng Yang, Liang-Chih Yu, K. Robert Lai and Weiyi Liu.

SENTIMENT ANALYSIS “unbelievably disappointing ” “Full of zany characters and richly applied satire, and some great plot twists”

“this is the greatest screwball comedy ever filmed” “ It was pathetic. The worst part about it was the boxing scenes.”

Sentiment Analysis Using NLP, statistics, or machine learning methods to extract, identify, or

otherwise characterize the sentiment content of a text unit Sometimes called opinion mining, although the emphasis in this case is

on extraction Other names: Opinion extraction 、 Sentiment mining 、 Subjectivity

analysis2

APPLICATION: PRODUCT REVIEWS

3

WHY SENTIMENT ANALYSIS? Movie: is this review positive or negative? Products: what do people think about the new iPhone? Public sentiment: how is consumer confidence? Is despair increasing?

Politics: what do people think about this candidate or issue? Prediction: predict election outcomes or market trends from sentiment

4

CHALLENGES IN SENTIMENT ANALYSIS People express opinions in complex ways In opinion texts, lexical content alone can be misleading Intra-textual and sub-sentential reversals, negation, topic change common

Rhetorical devices/modes such as sarcasm, irony, implication, etc.

5

POLARITY CLASSIFICATION (E.G., [GO+ 09, PANG+ 04]) Tokenization Feature Extraction: n-grams, semantics, syntactic, etc. Classification using different classifiers

Naïve Bayes MaxEnt SVM

Drawback Sparsity Context independent

S1: I really like this movie[...0 0 1 1 1 1 1 0 0 ... ]

6

S1: This phone has a good keypadS2: He will move and leave her for good

SEMANTIC CLUSTERINGBASIC IDEA

Using clustering algorithm to aggregate short text to form long clusters, in which each cluster has the same topic and the same sentiment polarity, to reduce the sparsity of short text representation and keep interpretation.

S1: it works perfectly! Love this productS2: very pleased! Super easy to, I love itS3: I recommend it

it works perfectly love this product very pleased super easy to I recommend

S1: [1 1 1 1 1 1 0 0 0 0 0 0 0]

S2: [0 0 0 1 0 0 1 1 1 1 1 1 0]

S3: [1 0 0 0 0 0 0 0 0 0 0 1 1]

S1+S2+S3: [...0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0...]7

CLUSTERING TRAINING DATA Training data labeled with positive and negative polarity K-means clustering algorithm is used to cluster positive and negative text separately. K-means, KNN, LDA…

works perfectly! Love this productcompletely useless, return policyvery pleased! Super easy to, I am pleasedwas very poor, it has failedhighly recommend it, high recommended!it totally unacceptable, is so bad

works perfectly! Love this productvery pleased! Super easy to, I am pleasedhighly recommend it, high recommended!

completely useless, return policywas very poor, it has failedit totally unacceptable, is so bad

Topical clusters8

ADVANTAGES OF CLUSTERING TEXT Topical consistency: texts in each cluster have similar topic Sparsity reduced: The representation of topical clusters is more dense than single text

Easy to apply the idea to other area

9

TRAINING CLASSIFIERSClassifier: Multinomial Naive BayesProbabilistic classifier: get the probability of label given a clustered text

,1

arg max ( | )

arg max ( ) ( | )Ci

is S

i js S j N

s P s C

P s P C s

( ) sNP sN

,

,

( , ) 1( | )

( | ) | |i j

i j

x V

N C sP C s

N x s V

Bayes’ theoryIndependent assumption

10

UNLABELED TEXT CLASSIFICATIONTWO-STAGE-MERGING METHOD Given an unlabeled text , we use Euclidean distance to find the most similar positive cluster , and the most similar negative cluster

The sentiment of , is estimated according to the probabilistic change of the two clusters when merging with . (vs. KNN)

This merging operation is called two-stage-merging method, as each unlabeled text will be merged two times.0, | ( ) ( ) | | ( ) ( ) |( )

1, .m m n n

j

P NC P C P NC P Cf x

otherwise

mC

jxnC

jxjx

11

EXPERIMENT Dataset: Stanford Twitter Sentiment Corpus (STS) Baseline: bag-of-unigrams and bigrams without clustering Evaluation Metrics: accuracy, precision, recall

The average precision and accuracy is 1.7% and 1.3% higher than the baseline method.

Methods Accuracy Precision Recall

Our Method 0.816 0.82 0.813

Bigrams 0.805 0.807 0.802

12

CONCLUSION AND FUTURE WORKS We introduce a Clustering algorithm based method to reduce sparsity problem for sentiment classification of short text

This idea can be applied to other area The above method is just a prototype work and some technique can be used to improve the model, including clustering algorithms, distributed representation and the two-stage-merging method.

Future works: Expanding this model use top-n similar clusters. Use distributed representation. Some deep learning model.

13

何云超 [email protected]

Thank youQ&A

14

yunchao he icot2015

Social Media