[asa] sentiment analysis in twitter, a study on the saudi community
TRANSCRIPT
Sentiment Analysis in Twitter a Study on the Saudi Community
Online talk by: Dr. Nora Altwairesh
Date: 11 Dec, 8:00-9:30pm
www.asa.imamu.edu.sa
Outline
•ASA •ASA Research Group?•Housekeeping •The talk
www.asa.imamu.edu.sa
Sentiment Analysis
• Keyword: iPhone • Tweets: Total Tweets’
Sentiments
Pos NegNeut
iPhone is great!
iPhone connection sucks!
I bought an iPhone yesterday
Yeah IPhone has long battery life its even longer than my life :@(Challenge)!
www.asa.imamu.edu.sa
Outline
•ASA •ASA Research Group?•Housekeeping •The talk
Arabic Sentiment AnalysisResearch Group
www.asa.imamu.edu.sa @asa__iu
www.asa.imamu.edu.sa
Group Members
Name RoleDr. Sarah alHumoud Principal Investigator
Dr. Areeb alOwisheq Co-Investigator
Dr. Nora alTwairesh Senior Investigator
Ms. Afnan alMoammar Ms. AlHanouf alSwilim
Ms. Mawaheb alTowijri Ms. Tarfa alBuhairi
Ms. Wejdan alOhaideb
www.asa.imamu.edu.sa
Arabic Sentiment Analysis Group• Create an Arabic corpora• Develop a Sentiment Analyzer web
service• Disseminate aims, findings and
developed resources:• Website• Workshops • Scientific articles
• ASA Survey (collection, classification, analysis)
• Analyze and compare different SA methodologies performances
• Develop an SA classifier with discourse relation
www.asa.imamu.edu.sa
Side Projects• Annotation
• 11 Annotators; • 142,434 Tweets
• Tools demo• ASA• Spam detection
www.asa.imamu.edu.sa
Coming events• Sentiment Analysis in
Social Media session in• HCII2017
• Publications in• Lecture Notes in
Computer Science (LNCS)• Deadline
• 17/ Dec/ 2016
www.asa.imamu.edu.sa
Outline
•ASA •ASA Research Group?•Housekeeping •The talk
www.asa.imamu.edu.sa
Ask and talk?• For Textual Question
Use QA, • if your question is
answered it will be public
• To Speak• raise your hand
www.asa.imamu.edu.sa
Attendees Countries
Saudi ArabiaUnited Arab EmiratesOtherOman
www.asa.imamu.edu.sa
Attendees Majors
CSISITOtherIMDSSE 0
10
20
30
40
50
60
www.asa.imamu.edu.sa
Outline
•ASA •ASA Research Group?•Housekeeping •The talk
Sentiment Analysis in Twitter a Study on the Saudi Community
Online talk by: Dr. Nora Altwairesh
Date: 11 Dec, 8:00-9:30pm
www.asa.imamu.edu.sa
The Speaker: Nora Al-Twairesh, Ph.D.
• Assistant Professor, • Information Technology Department• College of Computer and Information Sciences,• King Saud University• Riyadh, Saudi Arabia• Website: http://fac.ksu.edu.sa/twairesh • Research Groups:
• http://iwan.ksu.edu.sa • https://asa.imamu.edu.sa
• Research Interests:• Arabic Sentiment Analysis of Social Media text,• Arabic Natural Language Processing,• Web and Data Mining.
17www.asa.imamu.edu.sa
Contents• Introduction• What is Sentiment Analysis?• Why is it Important?• Sentiment Analysis of Arabic• Twitter• Research Motivation• Research Contributions • Results• Conclusion and Future Work
18www.asa.imamu.edu.sa
What is Sentiment Analysis?
• Sentiment analysis is “the field of study that analyzes people’s opinions, sentiments, appraisals, attitudes, and emotions toward entities and their attributes expressed in written text" (Liu, 2012)
• Different names: Sentiment Analysis, Opinion mining, opinion extraction, sentiment mining, subjectivity analysis
• Sentiment Analysis classifies text polarity (positive, negative, neutral and mixed)
19www.asa.imamu.edu.sa
What is Sentiment Analysis?
TweetSentiment
Positive Negative Neutral Mixed
إيجابي تغير خالد ـ الملك ـ مطارملحوظ
جدا فاشل مذيع أنه أثبت لألسف
قاريء برنامج لي ترشح ممكنممتاز باركود
االسعار لكن رائع جرير قارئغالية
لكن و جدا ممتاز بالجهاز انصحكثقيل عيبه
20www.asa.imamu.edu.sa
Why is it Important?
• The proliferation of social media websites has led to the production of vast amounts of unstructured text on the Web.
• Aggregating and evaluating these opinions manually is a tedious task and could be nearly impossible.
• These opinions are important for organizations (government, business) and for individuals
21www.asa.imamu.edu.sa
Sentiment Analysis Methods
• Lexicon-based: rule-based method that utilizes sentiment lexicons.
• Corpus-based: supervised learning that utilizes machine learning classifiers.
22www.asa.imamu.edu.sa
Research Motivation
• Hot research field• Challenges of Arabic language• Challenges of Twitter data
23www.asa.imamu.edu.sa
Arabic Language
• Morphologically Rich Language• Extremely challenging to process due to rich morphology
and complex word order• Diglossic situation with a multitude of dialects• Modern Standard Arabic : formal language• Dialects: informal language
24www.asa.imamu.edu.sa
Challenges of SA of Arabic Tweets
• Use of Dialectal Arabic (DA)• Lack of Arabic Corpora and Datasets• Lack of Arabic Sentiment Lexicons
25www.asa.imamu.edu.sa
• Why Twitter?
26www.asa.imamu.edu.sa
• Why Twitter?• Mubarak, H., and Darwish K. "Using Twitter to collect a multi-
dialectal corpus of Arabic." ANLP 2014 (2014): 1.• 175 M Arabic tweets • during March 2014• 6.5 M tweets
27www.asa.imamu.edu.sa
Characteristics of Twitter Data
• Language is informal• Short: 140 characters or less• Abbreviations and shortenings• Wide array of topics and large vocabulary• Spelling mistakes and creative spellings• Special strings: hashtags, emoticons,
conjoined words
28www.asa.imamu.edu.sa
Research Contributions• Collecting a large dataset of Arabic Tweets 2.2M.• AraSenti-Tweet Corpus: A corpus of Saudi tweets was
constructed from the dataset of tweets.• AraSenti Lexicon: A sentiment lexicon of Arabic words was
extracted from the dataset of tweets. • Constructing an extensive list of Arabic contextual valence
shifters (negators, intensifiers, diminishers, modal words and contrast words).
• Lexicon-based method.• Corpus-based method.• Hybrid method.
29www.asa.imamu.edu.sa
Data Collection
• EMO-TWEET Dataset:• distant supervision: using emoticons as noisy labels :positive, : negative.
• KEY-TWEET Dataset:• sentiment words as search keywords, ex: – سيء أعجبني
• Saudi-Tweet Dataset: • Tweet or user location set to Saudi location
30www.asa.imamu.edu.sa
Data Collection and Preprocessing
31www.asa.imamu.edu.sa
Data Collection and Preprocessing
32www.asa.imamu.edu.sa
AraSenti-Tweet Corpus• Set of ~ 13,000 tweets were selected from the Saudi
Dataset• Most of the annotated tweets in the first stage were
positive or negative and we needed to augment the dataset with more neutral tweets, so we collected 4,000 tweets from two Saudi news accounts
• More tweets were collected to set up the test set ~2000 tweets
33www.asa.imamu.edu.sa
AraSenti-Tweet Corpus
Class No. of Tweets No. of Tokens
Positive 4,957 93,601
Negative 6,155 127,182
Neutral 4,639 71,492
Mixed 1,822 39,883
Total 17,573 332,158
34www.asa.imamu.edu.sa
AraSenti Lexicon
• AraSenti-Trans: Using MADAMIRA, the English glosses of the extracted words from the tweets were compared to English sentiment lexicons using certain heuristics. Then a manual correction was performed
• AraSenti-PMI: The second lexicon was generated through calculating the pointwise mutual information (PMI) measure for all words in the positive and negative datasets of tweets.
• Sentiment Score(w)=PMI(w,pos)-PMI(w,neg)
35www.asa.imamu.edu.sa
Significance of AraSenti Lexion
• Captures the idiosyncratic nature of social media text.• Provides sentiment intensity of words, not only the
sentiment orientation.• MSA and DA• High coverage :200K words
36www.asa.imamu.edu.sa
Arabic Valence Shifters
• Extensive list of Arabic valence shifters extracted from the datasets through similarity measures.
• Negation words, intensifiers, diminishers, modal words, presuppositional and contrast words.
• Different hypotheses were evaluated for negation handling.
37www.asa.imamu.edu.sa
Arabic Valence Shifters
Example Sentiment Valence shifter
ممتع الكتاب .هذا Positive None
الكتاب ممتع غيرهذا . Negative Negation
ألنه االخالق سيء الرجل هذاالعامل .أهان
Negative None
كان االخالق لو سيء الرجلالعامل .ألهان
Neutral Modal
جيد الكتاب .هذا Positive None
جيد ظنيت كتاب بيكون إنه . Neutral or Negative
Presuppositional
ينجح أن .استطاع Positive None
ينجح بالكاد أن استطاع . Negative Presuppositional
38www.asa.imamu.edu.sa
Sentiment Analysis Methods
• Three sentiment analysis methods:• Lexicon-based• Corpus-based• Hybrid
• Three classification models: • Two-way classification (positive, negative),• Three-way classification (positive, negative, neutral) • Four-way classification (positive, negative, neutral, mixed)
39www.asa.imamu.edu.sa
Lexicon-based Method
• Rule-based method that utilizes the AraSenti-lexicon and performs context-aware sentiment analysis by special handling of negation and contextual valence shifters.
• Calculates sentiment score which represents sentiment intensity in addition to polarity.
40www.asa.imamu.edu.sa
Corpus-based Method
• Supervised learning method that utilizes ML classifiers using the AraSenti-Tweet corpus.
• Used SVM linear kernel.• Features engineered: syntactic, semantic, and Twitter
specific. • Semantic features include the AraSenti-lexicon.• Performed feature backward selection to reach best set of
features.
41www.asa.imamu.edu.sa
Hybrid Method
• The approach was to incorporate the knowledge extracted from the rule-based method as features into the statistical method.
• The tweet score that is calculated in the lexicon-based method was added to the features used in the corpus-based method.
• The hybrid method exhibited significant increases in performance for two-way and three-way classification.
• However, in four-way classification the performance of the hybrid and corpus-based method was almost the same.
42www.asa.imamu.edu.sa
Results
Lexicon-based
Corpus-based Hybrid
Two-way classification 67.08 65.7 69.9
Three-way classification 45.69 59.85 61.63
Four-way classification 34.8 55.38 55.07
43www.asa.imamu.edu.sa
Conclusion and Future Work
• Twitter ANLP tool: Arabic language needs enabling technologies for preprocessing Twitter data.
• Other statistical methods for generating the lexicon: Chi-Square and Information Gain.
• A sentiment treebank that allows for a complete analysis of the compositional effects of sentiment in Arabic language would enable better classification.
44www.asa.imamu.edu.sa
Conclusion and Future Work
• Better handling of negation and valence shifters through constructing a specialized corpus that contains these valence shifters and annotating them with regard to the impact on sentiment.
• Sarcasm detection in tweets is a vital research direction.• Future solutions should be domain specific, dialect
specific and periodically updated to adhere to the time shift in the language on Twitter.
45www.asa.imamu.edu.sa
Conclusion and Future Work
• Major and novel contributions to the field can be accomplished through collaboration of computer scientists, linguist experts and social scientists.
• Hence, interdisciplinary research is a major research necessity for the field to flourish and advance.
46www.asa.imamu.edu.sa
• Thank you..• Questions?