thumbs up? sentiment classification using machine learning ... · thumbs up? sentiment...

75
Thumbs up? Sentiment Classification using Machine Learning Techniques Bo Pang, Lillian Lee, and Shivakumar Vaithyanathan. 2002. Thumbs up? Sentiment Classification using Machine Learning Techniques. EMNLP-2002, 79—86. Kalaiselvan Panneerselvam Course 3111: Seminar: Data Analytics I 1

Upload: others

Post on 12-Mar-2020

27 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Thumbs up? Sentiment Classification using Machine Learning ... · Thumbs up? Sentiment Classification using Machine Learning Techniques • From above experiment, it is proved worthwhile

Thumbsup?SentimentClassificationusingMachineLearningTechniques

BoPang,LillianLee,andShivakumarVaithyanathan.2002.Thumbsup?SentimentClassificationusingMachineLearningTechniques.EMNLP-2002,

79—86.

KalaiselvanPanneerselvam

Course3111:Seminar:DataAnalyticsI 1

Page 2: Thumbs up? Sentiment Classification using Machine Learning ... · Thumbs up? Sentiment Classification using Machine Learning Techniques • From above experiment, it is proved worthwhile

Outline

• WhatisSentimentAnalysis• SentimentClassificationinMovieReviews• BaselineClassifier• SupervisedLearningProcess• Framework• FeatureExtraction• Classifiers• Results• References

Course3111:Seminar:DataAnalyticsI 2

Page 3: Thumbs up? Sentiment Classification using Machine Learning ... · Thumbs up? Sentiment Classification using Machine Learning Techniques • From above experiment, it is proved worthwhile

Terms

• Sentiment• Athought,view,orattitude,especiallyonebasedmainlyonemotioninsteadofreason

• SentimentAnalysis• akaopinionmining• useofnaturallanguageprocessing(NLP)andcomputationaltechniquestoautomatetheextractionorclassificationofsentimentfromtypicallyunstructuredtext

Course3111:Seminar:DataAnalyticsI 3

Page 4: Thumbs up? Sentiment Classification using Machine Learning ... · Thumbs up? Sentiment Classification using Machine Learning Techniques • From above experiment, it is proved worthwhile

SentimentAnalysisWhat isitusedfor?

• Naturallanguageandtextprocessingtoidentifyandextractsubjectiveinformation

• Classifyingthepolarity ofagiventextaspositive,negative orneutral

• Ingeneral:todiscoverhowpeoplefeelaboutaparticulartopic

Course3111:Seminar:DataAnalyticsI 4

Page 5: Thumbs up? Sentiment Classification using Machine Learning ... · Thumbs up? Sentiment Classification using Machine Learning Techniques • From above experiment, it is proved worthwhile

SentimentAnalysiswho isitusedby?

• Consumerinformation• Productreviews

• Marketing• Consumerattitudes• Trends

• Politics• Politicianswanttoknowvoters’views• Voterswanttoknowpolicitians’stancesandwhoelsesupportsthem

• Social• Findlike-mindedindividualsorcommunities

Course3111:Seminar:DataAnalyticsI 5

Page 6: Thumbs up? Sentiment Classification using Machine Learning ... · Thumbs up? Sentiment Classification using Machine Learning Techniques • From above experiment, it is proved worthwhile

BingShopping

• a

Course3111:Seminar:DataAnalyticsI 6

Page 7: Thumbs up? Sentiment Classification using Machine Learning ... · Thumbs up? Sentiment Classification using Machine Learning Techniques • From above experiment, it is proved worthwhile

SentimentClassificationinMovieReviews

• Polaritydetection:• IsanIMDBmoviereviewpositiveornegative?

• Data:PolarityData2.0:• http://www.cs.cornell.edu/people/pabo/movie-review-data• Usereviewswithstarornumericalvalueastrainingandtestdata.

BoPang,LillianLee,andShivakumarVaithyanathan.2002.Thumbsup? SentimentClassification usingMachine

LearningTechniques. EMNLP-2002,79—86.

Course3111:Seminar:DataAnalyticsI 7

Page 8: Thumbs up? Sentiment Classification using Machine Learning ... · Thumbs up? Sentiment Classification using Machine Learning Techniques • From above experiment, it is proved worthwhile

IMDBdatainthePangandLeedatabase

when_starwars_cameoutsometwentyyearsago,theimageof travelingthroughout thestarshasbecomeacommonplace image.[…]

whenhansologoeslightspeed,thestarschangetobright lines,going towardstheviewerinlinesthatconvergeataninvisiblepoint .

cool.

_octobersky_offersamuchsimplerimage–thatofasinglewhitedot,travelinghorizontallyacrossthenight sky.[...]

“ snake eyes ” is the most aggravating kind of movie : the kind that shows so much potential then becomes unbelievably disappointing . it’s not just because this is a brian depalma film , and since he’s a great director and one who’s films are always greeted with at least some fanfare . and it’s not even because this was a film starring nicolas cage and since he gives a brauvara performance , this film is hardly worth his talents .

✓ ✗

Course3111:Seminar:DataAnalyticsI 8

Page 9: Thumbs up? Sentiment Classification using Machine Learning ... · Thumbs up? Sentiment Classification using Machine Learning Techniques • From above experiment, it is proved worthwhile

TheData

pInternetMovieDatabase(IMDB)archivepLimiteddatato:

nReviewswithauthorratingnPositiveandnegativereviews(noneutral)n19positive,19negativereviewsperauthor

pInterimDataset:n752negativereviewsn1301positivereviewsn144reviewersrepresented

pFinalDataset:700positive,700negative(uniformdistribution)

Course3111:Seminar:DataAnalyticsI 9

Page 10: Thumbs up? Sentiment Classification using Machine Learning ... · Thumbs up? Sentiment Classification using Machine Learning Techniques • From above experiment, it is proved worthwhile

PriorWork

• Priorclassificationbasedon:• source/sourcestyle• genre• knowledge-based• semanticorientationusingtextcategorization

Course3111:Seminar:DataAnalyticsI 10

Page 11: Thumbs up? Sentiment Classification using Machine Learning ... · Thumbs up? Sentiment Classification using Machine Learning Techniques • From above experiment, it is proved worthwhile

Positiveornegativemoviereview?

• unbelievablydisappointing• Fullofzanycharactersandrichlyappliedsatire,andsomegreatplottwists

• thisisthegreatestscrewballcomedyeverfilmed• Itwaspathetic.Theworstpartaboutitwastheboxingscenes.

Course3111:Seminar:DataAnalyticsI 11

Page 12: Thumbs up? Sentiment Classification using Machine Learning ... · Thumbs up? Sentiment Classification using Machine Learning Techniques • From above experiment, it is proved worthwhile

Baseline(HumanClassifier)

p CraftedwordlistsusingindependentCSgradstudents

p Positivevs.negativewordcount

Positive List Negative List Accuracy

Ties

Human 1

dazzling, brilliant, phenomenal, excellent, fantastic

suck, terrible, awful, unwatchable, hideous

58% 75%

Human 2

gripping, mesmerizing, riveting, spectacular, cool, awesome, thrilling, badass, excellent, moving, exciting

bad, clichéd, sucks, boring, stupid, slow

64% 39%

q Frequencycounts(including testdata)

q Hand-pickedwords

Positive List Negative List Accuracy

Ties

Human 3 + stats

love, wonderful, best, great, superb, still, beautiful

bad, worst, stupid, waste, boring, ?, !

69% 16%

*Tierates- Percentageofdocumentswherethetwosentimentswereratedequallylikely.

Course3111:Seminar:DataAnalyticsI

12

Page 13: Thumbs up? Sentiment Classification using Machine Learning ... · Thumbs up? Sentiment Classification using Machine Learning Techniques • From above experiment, it is proved worthwhile

Thumbsup?SentimentClassificationusingMachineLearningTechniques

• Fromaboveexperiment,itisprovedworthwhiletoexplorecorpus-basedtechniques,ratherthanrelyingonpriorintuitions,toselectgoodindicatorfeaturesandtoperformsentimentclassificationingeneral.

Course3111:Seminar:DataAnalyticsI 13

Page 14: Thumbs up? Sentiment Classification using Machine Learning ... · Thumbs up? Sentiment Classification using Machine Learning Techniques • From above experiment, it is proved worthwhile

SupervisedLearningProcess

Course3111:Seminar:DataAnalyticsI 14

Page 15: Thumbs up? Sentiment Classification using Machine Learning ... · Thumbs up? Sentiment Classification using Machine Learning Techniques • From above experiment, it is proved worthwhile

Framework

Movie Reviews

Develop Features

NB

ME

SVM

Extract Insights

Evaluate Results

Training Model

Course3111:Seminar:DataAnalyticsI 15

Page 16: Thumbs up? Sentiment Classification using Machine Learning ... · Thumbs up? Sentiment Classification using Machine Learning Techniques • From above experiment, it is proved worthwhile

Framework(continued…)

Course3111:Seminar:DataAnalyticsI 16

Page 17: Thumbs up? Sentiment Classification using Machine Learning ... · Thumbs up? Sentiment Classification using Machine Learning Techniques • From above experiment, it is proved worthwhile

SentimentTokenizationIssues

• DealwithHTMLandXMLmarkup• Twittermark-up(names,hashtags)• Capitalization(preserveforwordsinallcaps)• Phonenumbers,datesandEmoticons• Commonlyusedtokenizer:

• ChristopherPottssentimenttokenizer• BrendanO’Connortwittertokenizer

Course3111:Seminar:DataAnalyticsI 17

Page 18: Thumbs up? Sentiment Classification using Machine Learning ... · Thumbs up? Sentiment Classification using Machine Learning Techniques • From above experiment, it is proved worthwhile

ExtractingFeaturesforSentimentClassification

pHowtohandlenegationnI didn’t like this movie

vsnI really like this movie

pWhichwordstouse?nOnlyadjectivesnAllwords

pAllwordsturnsouttoworkbetter,atleastonthisdata

Course3111:Seminar:DataAnalyticsI 18

Page 19: Thumbs up? Sentiment Classification using Machine Learning ... · Thumbs up? Sentiment Classification using Machine Learning Techniques • From above experiment, it is proved worthwhile

ExtractingFeatures(Continued…)

• Features• Unigrams:Asingleword.• Featurefrequency:frequencyofafeatureappears• Featurepresence:1onlywhenafeatureappears• Bigrams:Twocontinuesword.• PartsofSpeech:TagthewordwithitsPOS.• Adjectives:Onlyuseadjectivesinthetext.• Position:Thepositionofawordinthetext.Inthefirstquarter,lastquarterorthemiddlehalf.

Course3111:Seminar:DataAnalyticsI 19

Page 20: Thumbs up? Sentiment Classification using Machine Learning ... · Thumbs up? Sentiment Classification using Machine Learning Techniques • From above experiment, it is proved worthwhile

Classifiers

Course3111:Seminar:DataAnalyticsI 20

Page 21: Thumbs up? Sentiment Classification using Machine Learning ... · Thumbs up? Sentiment Classification using Machine Learning Techniques • From above experiment, it is proved worthwhile

NaïveBayesClassifiers

Course3111:Seminar:DataAnalyticsI 21

Page 22: Thumbs up? Sentiment Classification using Machine Learning ... · Thumbs up? Sentiment Classification using Machine Learning Techniques • From above experiment, it is proved worthwhile

NaïveBayesClassifiers(Continued…)

Course3111:Seminar:DataAnalyticsI 22

Page 23: Thumbs up? Sentiment Classification using Machine Learning ... · Thumbs up? Sentiment Classification using Machine Learning Techniques • From above experiment, it is proved worthwhile

NaïveBayesClassifiers(Continued…)

NaïveBayes:

givendocumentd,theclassc*=argmaxcP(c|d).Assumeallfeaturesareconditionallyindependent

ni(d)isthenumber oftimesfioccursindocumentd.

Course3111:Seminar:DataAnalyticsI 23

Page 24: Thumbs up? Sentiment Classification using Machine Learning ... · Thumbs up? Sentiment Classification using Machine Learning Techniques • From above experiment, it is proved worthwhile

Example

Course3111:Seminar:DataAnalyticsI 24

Page 25: Thumbs up? Sentiment Classification using Machine Learning ... · Thumbs up? Sentiment Classification using Machine Learning Techniques • From above experiment, it is proved worthwhile

Example(Continued…)

Course3111:Seminar:DataAnalyticsI 25

Page 26: Thumbs up? Sentiment Classification using Machine Learning ... · Thumbs up? Sentiment Classification using Machine Learning Techniques • From above experiment, it is proved worthwhile

Example(Continued…)

Course3111:Seminar:DataAnalyticsI 26

Page 27: Thumbs up? Sentiment Classification using Machine Learning ... · Thumbs up? Sentiment Classification using Machine Learning Techniques • From above experiment, it is proved worthwhile

Course3111:Seminar:DataAnalyticsI

MaximumEntropy

ØMaximumEntropyisatechniqueforlearningprobabilitydistributionsfromdata

Ø“Don’tassumeanythingaboutyourprobabilitydistributionotherthanwhatyouhaveobserved.”

ØAlwayschoosethemostuniformdistributionsubjecttotheobservedconstraints.

27

Page 28: Thumbs up? Sentiment Classification using Machine Learning ... · Thumbs up? Sentiment Classification using Machine Learning Techniques • From above experiment, it is proved worthwhile

MaximumEntropy

• MaximumEntropy:

Z(d)isanormalizationfunction.Thearefeature-weightparameters.Largermeansfiisconsideredastrongindicatorforclassc.

Course3111:Seminar:DataAnalyticsI 28

Page 29: Thumbs up? Sentiment Classification using Machine Learning ... · Thumbs up? Sentiment Classification using Machine Learning Techniques • From above experiment, it is proved worthwhile

Find a hyperplane that makes the margin between two categories.

SupportVectorMachine

Course3111:Seminar:DataAnalyticsI 29

Page 30: Thumbs up? Sentiment Classification using Machine Learning ... · Thumbs up? Sentiment Classification using Machine Learning Techniques • From above experiment, it is proved worthwhile

Scenario1

Course3111:Seminar:DataAnalyticsI 30

Page 31: Thumbs up? Sentiment Classification using Machine Learning ... · Thumbs up? Sentiment Classification using Machine Learning Techniques • From above experiment, it is proved worthwhile

Scenario2

Course3111:Seminar:DataAnalyticsI 31

Page 32: Thumbs up? Sentiment Classification using Machine Learning ... · Thumbs up? Sentiment Classification using Machine Learning Techniques • From above experiment, it is proved worthwhile

Results• Resultsfordifferentfeature:

• Unigramsworksbetterthanbaseline• Presenceisbetterthanfrequency• Bigramfeaturedoesnotimproveperformance• Adjectivesarepoor• POSimproveslightforNBandME,butdeclineforSVM• Positionalsodoesnothelp

Course3111:Seminar:DataAnalyticsI 32

Page 33: Thumbs up? Sentiment Classification using Machine Learning ... · Thumbs up? Sentiment Classification using Machine Learning Techniques • From above experiment, it is proved worthwhile

Insights

• SVMisfoundtobemoreaccurate.• Notcomparabletotopic-basedcategorizationmodels.

• Simpleunigrampresencethebest.• Presence>Frequency,notliketopic-based.

• Uncovered“thwartedexpectations”narrative.

• “Okay,I’mreallyashamedofit,butIenjoyedit.Imean,Iadmitit’sareallyawfulmovie.”

Course3111:Seminar:DataAnalyticsI 33

Page 34: Thumbs up? Sentiment Classification using Machine Learning ... · Thumbs up? Sentiment Classification using Machine Learning Techniques • From above experiment, it is proved worthwhile

References

p B.Pang,L.Lee, andS.Vaithyanathan,“Thumbsup?Sentimentclassificationusingmachinelearningtechniques,”inProcConfonEmpiricalMethodsinNaturalLanguageProcessing(EMNLP),pp.79–86,2002.

p EsuliA,SebastianiF.SentiWordNet:APubliclyAvailableLexicalResourceforOpinionMining.In:ProcofLREC2006- 5thConfonLanguageResourcesandEvaluation,2006.

p ZhangE,ZhangY.UCSConTREC2006BlogOpinionMining.TREC2006BlogTrack,OpinionRetrievalTask.

pDevittA,AhmadK. SentimentPolarityIdentificationinFinancialNews:ACohesion-basedApproach.ACL2007.p BoPang,LillianLee,Asentimentaleducation:sentimentanalysisusing

subjectivitysummarizationbasedonminimumcuts,Proceedingsofthe42ndAnnualMeetingonAssociationforComputationalLinguistics,p.271-

es,July21-26,2004.

Course3111:Seminar:DataAnalyticsI 34

Page 35: Thumbs up? Sentiment Classification using Machine Learning ... · Thumbs up? Sentiment Classification using Machine Learning Techniques • From above experiment, it is proved worthwhile

Course3111:Seminar:DataAnalyticsI 35

Page 36: Thumbs up? Sentiment Classification using Machine Learning ... · Thumbs up? Sentiment Classification using Machine Learning Techniques • From above experiment, it is proved worthwhile

TwitterasacorpusforSentimentAnalysisandOpinionMining

AlexanderPak,PatrickParoubek

Université deParis-Sud,LREc,2010

Course3111:Seminar:DataAnalyticsI 36

Page 37: Thumbs up? Sentiment Classification using Machine Learning ... · Thumbs up? Sentiment Classification using Machine Learning Techniques • From above experiment, it is proved worthwhile

Abstract

• Millionsofusersshareopinions onmicroblogging sites

• Richdatasourceforopinionminingandsentimentanalysis

• Focusontwittercorpuscollectionforsentimentanalysis

• Corpus isusedtobuildasentimentclassifier

• Claimsthattheproposed techniquesworkbetterthanpreviousmethods

Course3111:Seminar:DataAnalyticsI 37

Page 38: Thumbs up? Sentiment Classification using Machine Learning ... · Thumbs up? Sentiment Classification using Machine Learning Techniques • From above experiment, it is proved worthwhile

Introduction

• Microblogging hasbecomeverypopularnowadays

• Peoplegenerallywriteabouttheirlife,shareopinions anddiscusscurrentissues

• Userspostaboutproducts&services,expresspolitical andreligiousviews

• Twitterhasenormous numberoftextpostsanditgrowseveryday

• Audiencevariesfromregular users,tocelebs topoliticians andevenpresident

• Thiscanbeusedformarketing orsocialstudies

• Authorsperformed thefollowing steps:• Collected 300,000tweets,dividedtheminto3sets:Positive,NegativeandNeutral• Performedlinguisticanalysisofcollectedcorpus• Builtasentimentclassifier• ExperimentalEvaluation

Course3111:Seminar:DataAnalyticsI 38

Page 39: Thumbs up? Sentiment Classification using Machine Learning ... · Thumbs up? Sentiment Classification using Machine Learning Techniques • From above experiment, it is proved worthwhile

CorpusCollection

• UsedtwitterAPI tocollectpositive,negative andobjective sentiments

• Forpositiveandnegativesentiments,usedemoticon approach:• Happy emoticons:“:-)”,“:)”,“=)”,“:D”etc. • Sad emoticons: “:-(”, “:(”, “=(”, “;(” etc.

• Forobjective posts,tweetspulledfrom44popularnewsoutletssuchas‘Washington Post’and‘NewYorkTimes’

• Assumption:Duetoshortcharacterlimit,itwasassumedthatemoticonsrepresentthesentimentfortheentiretweet

• English languagewasused forresearch,however, themethodcanbeeasilyadaptedtootherlanguages

Course3111:Seminar:DataAnalyticsI 39

Page 40: Thumbs up? Sentiment Classification using Machine Learning ... · Thumbs up? Sentiment Classification using Machine Learning Techniques • From above experiment, it is proved worthwhile

CorpusAnalysis

• Wordfrequencydistribution followsZipf’s law

• Thisconfirmsapropercharacteristicofthecorpus

• UseTreeTagger (Schmid 1994)forPOStagging

• Comparedtagdistributions betweensetsoftexts

• Pairwisecomparisonofeachtagdonebycalculating:

• WhereNi isthenumberofoccurences oftagTinsetI

• Comparedtwosetsofdata:i. Subjective vs.Objectiveii. Negativevs.Positive

Course3111:Seminar:DataAnalyticsI 40

Page 41: Thumbs up? Sentiment Classification using Machine Learning ... · Thumbs up? Sentiment Classification using Machine Learning Techniques • From above experiment, it is proved worthwhile

TreeTagger

• TreeTagger isatoolforannotating textwithpart-of-speechandlemmainformation

word pos lemma

The DT the

TreeTagger NP TreeTagger

is VBZ be

easy JJ easy

to TO to

use VB use

. SENT .

Course3111:Seminar:DataAnalyticsI 41

Page 42: Thumbs up? Sentiment Classification using Machine Learning ... · Thumbs up? Sentiment Classification using Machine Learning Techniques • From above experiment, it is proved worthwhile

CorpusAnalysis

Objective Subjective

Containsmorecommonandpropernous (NPS,NP,NNS)

Containsmorepersonalpronouns(PP, PP$)

Verbsareusually in3rd person(VBZ) Verbsarein1st of2nd person(VBP)

pastparticiple(VBN)used Usesimplepasttense(VBD)

Comparativeadjectives(JJR)used forstatingfacts

Superlativeadjectives (JJS) usedforexpressingemotions

Positive Negative

Superlativeadverbsandpossesiveendings (POS)

Verbsinpasttense(VBN,VBD)

Course3111:Seminar:DataAnalyticsI 42

Page 43: Thumbs up? Sentiment Classification using Machine Learning ... · Thumbs up? Sentiment Classification using Machine Learning Techniques • From above experiment, it is proved worthwhile

TrainingClassifier

• Useacombination ofn-gramapproachandPOStaggingforsentimentanalysis

• Experimentedwithunigrams,bigrams andtrigrams tofindbestsettingsformicroblogging data

• DataPreperation:i. Filtration:RemoveURLs,usernamesandspecialwordslike‘RT’ii. Tokenization:Splittexttocreatebagofwordsiii. RemoveStopwords:Removearticlessuchas‘a’,‘an’,‘the’frombagofwordsiv. Constructn-grams:Ensurethatnegationisattachedtothewordandconsidered asasingleword

• Usednaïvebayes classifier

wheres issentiment,M isamessage

Course3111:Seminar:DataAnalyticsI 43

Page 44: Thumbs up? Sentiment Classification using Machine Learning ... · Thumbs up? Sentiment Classification using Machine Learning Techniques • From above experiment, it is proved worthwhile

TrainingClassifier

• Wetrain2bayes classifiers:Onewithn-gramsasfeatures,anotherwithPOSdistribution

• Assumption:POStagsareconditionally independent ofn-grams

• WhereGisthesetofN-gramsrepresentingamessageandTisthesetofPOS-tags

• Assumption:n-gramsandPOS-tagsarealsoconditionally independent

• Forall

• Finallywetaketheloglikelihood ofeachsentiment

Course3111:Seminar:DataAnalyticsI 44

Page 45: Thumbs up? Sentiment Classification using Machine Learning ... · Thumbs up? Sentiment Classification using Machine Learning Techniques • From above experiment, it is proved worthwhile

IncreasingAccuracy

• Toincreaseaccuracy,wecalculateentropy ofaprobabilitydistribution

• Highentropyindicatesthatthedistribution ofsentiments indifferent sentimentdatasetsisclosetouniform

• Thesen-gramscanbediscardedastheydonotcontributemuch totheclassification

• Weintroduceanother term,salience

• Saliencetakesavaluebetween0and1.Lowvaluen-gramsshouldbediscarded

• Thefinalequation remainsthesame,exceptthefactthatwediscardn-gramswithentropyhigher thanθ orsaliencelowerthanθ

Course3111:Seminar:DataAnalyticsI 45

Page 46: Thumbs up? Sentiment Classification using Machine Learning ... · Thumbs up? Sentiment Classification using Machine Learning Techniques • From above experiment, it is proved worthwhile

ModelValidation

• Classifierwastestedonasetofrealhand-annotatedtwitterposts Sentiment Samplesize

PositiveNegativeNeutralTotal

1087533216

Course3111:Seminar:DataAnalyticsI 46

Page 47: Thumbs up? Sentiment Classification using Machine Learning ... · Thumbs up? Sentiment Classification using Machine Learning Techniques • From above experiment, it is proved worthwhile

Results

• Bestresultsareachievedwhenusingbi-grams

• Bigramshavegoodbalancebetweencoverage,andabilitytocapturesentimentpatterns

• Themodelhashighaccuracy,howeverlowerdecisionvalue

• Classifierisoptimalforasentimentsearchengine

• F-measureisusedtomeasureperformance,howeverwithsomechanges:

Keepingbetaat0.5

• Increasingsamplesizeimprovesperformance,butonlyuptoacertainpoint

• Salience discriminatescommonn-gramsbetterthanentropy

Course3111:Seminar:DataAnalyticsI 47

Page 48: Thumbs up? Sentiment Classification using Machine Learning ... · Thumbs up? Sentiment Classification using Machine Learning Techniques • From above experiment, it is proved worthwhile

Conclusion&FutureWork

• Microblogging isanattractivesourceofdataforsentimentandopinionmining

• Authorsusesyntacticstructurestodescribeemotionsorstatefacts

• SomePOS-tagsmaybestrongindicatorsofemotionaltext

• Classifier isabletodetermine positive,negativeandneutralsentiments ofdocuments

• Plan tocollectamultilingual corpusofTwitterdataandcomparethecharacteristicsofthecorpusacrossdifferentlanguages

Course3111:Seminar:DataAnalyticsI 48

Page 49: Thumbs up? Sentiment Classification using Machine Learning ... · Thumbs up? Sentiment Classification using Machine Learning Techniques • From above experiment, it is proved worthwhile

MyThoughts

• PartofSpeechtaggingisaneffectivewaytounderstandnuancesoflanguageandemotion

• Authorsclaimed thattheirtechniqueisbetterthanpreviousmethods, however, theynevercomparedittopreviousmethods intheentiretyofthepaper

• Although conditionalindependencycanhelpsimplifycalculations,itistoonaïveanassumption

• Another implicitassumptionisthatpeoplespellwordsproperlyon twitter.Nottrue.

• Moreover, sarcasmisoneofthebigproblemswhichcannotbehandledbysimplesentimentclassifiers.Twitter isfullofsarcasm

• Testsetistoosmall(216datapoints) comparedtotrainset(300,000datapoints!)

• Asof2018,theproposed futureworkhasnotbeenpublishedbythemainauthor

Course3111:Seminar:DataAnalyticsI 49

Page 50: Thumbs up? Sentiment Classification using Machine Learning ... · Thumbs up? Sentiment Classification using Machine Learning Techniques • From above experiment, it is proved worthwhile

References• G.Adda,J.Mariani,J.Lecomte,P.Paroubek,andM.Raj-man.1998.TheGRACEFrenchpart-of-speechtaggingevaluationtask.InA.Rubio,N.Gallardo,R.Castro,andA.Tejada,

editors,LREC,volumeI,pages433–441,Granada,May.

• Ethem Alpaydin.2004.IntroductiontoMachineLearn- ing (AdaptiveComputationandMachineLearning).TheMITPress.

• Hayter AnthonyJ.2007.ProbabilityandStatisticsforEn- gineers andScientists.Duxbury,Belmont,CA,USA.Kushal Dave,SteveLawrence,andDavidM.Pennock.

• 2003.Miningthepeanutgallery:opinionextractionandsemanticclassificationofproductreviews.InWWW’03:Proceedingsofthe12thinternationalconferenceonWorldWideWeb,pages519–528,NewYork,NY,USA.ACM.

• AlecGo,LeiHuang,andRicha Bhayani.2009.Twit- ter sentimentanalysis.FinalProjectsfromCS224NforSpring2008/2009atTheStanfordNaturalLanguageProcessingGroup.

• BernardJ.Jansen,MimiZhang,KateSobel,andAbdur Chowdury.2009.Micro-bloggingasonlinewordofmouthbranding.InCHIEA’09:Proceedingsofthe27thinternationalconferenceextendedabstractsonHumanfactorsincomputingsystems,pages3859–3864,NewYork,NY,USA.ACM.

• JohnD.Lafferty,AndrewMcCallum,andFernandoC.N.Pereira.2001.Conditionalrandomfields:Probabilisticmodelsforsegmentingandlabelingsequencedata.InICML’01:ProceedingsoftheEighteenthInternationalConferenceonMachineLearning,pages282–289,SanFrancisco,CA,USA.MorganKaufmannPublishersInc.

• ChristopherD.ManningandHinrich Schu ̈tze.1999.Foun- dations ofstatisticalnaturallanguageprocessing.MITPress,Cambridge,MA,USA.

• BoPangandLillianLee.2008.Opinionminingandsenti-ment analysis.Found.TrendsInf.Retr.,2(1-2):1–135.BoPang,LillianLee,andShivakumar Vaithyanathan.

• 2002.Thumbsup?sentimentclassificationusingma- chinelearningtechniques.InProceedingsoftheCon- ference onEmpiricalMethodsinNaturalLanguagePro- cessing (EMNLP),pages79–86.

• TedPedersen.2000.Asimpleapproachtobuildingen- sembles ofnaivebayesian classifiersforwordsensedis- ambiguation.InProceedingsofthe1stNorthAmericanchapteroftheAssociationforComputationalLinguis- ticsconference,pages63–69,SanFrancisco,CA,USA.MorganKaufmannPublishersInc.

• JonathonRead.2005.Usingemoticonstoreducedepen- dency inmachinelearningtechniquesforsentimentclas- sification.InACL.TheAssociationforComputerLin- guistics.

• HelmutSchmid.1994.Probabilisticpart-of-speechtag- ging usingdecisiontrees.InProceedingsoftheInter- nationalConferenceonNewMethodsinLanguagePro- cessing,pages44–49.

• ClaudeE.ShannonandWarrenWeaver.1963.AMathe-matical TheoryofCommunication.UniversityofIllinoisPress,Champaign,IL,USA.

• TheresaWilson,JanyceWiebe,andPaulHoffmann.2005.Recognizingcontextualpolarityinphrase-levelsenti-ment analysis.InHLT’05:Proceedingsofthecon- ference onHumanLanguageTechnologyandEmpiricalMethodsinNaturalLanguageProcessing,pages347–354,Morristown,NJ,USA.AssociationforComputa- tional Linguistics.

• Changhua Yang,KevinHsin-Yih Lin,andHsin-Hsi Chen.2007.Emotionclassificationusingwebblogcorpora.InWI’07:ProceedingsoftheIEEE/WIC/ACMInterna- tional ConferenceonWebIntelligence,pages275–278,Washington,DC,USA.IEEEComputerSociety.

Course3111:Seminar:DataAnalyticsI 50

Page 51: Thumbs up? Sentiment Classification using Machine Learning ... · Thumbs up? Sentiment Classification using Machine Learning Techniques • From above experiment, it is proved worthwhile

DEEP CONVOLUTIONAL NEURAL NETWORKS FOR

SENTIMENT ANALYSIS OF SHORT TEXTS

John Robert 278822

Seminar Data Analytics I

Course3111:Seminar:DataAnalyticsI 51

Page 52: Thumbs up? Sentiment Classification using Machine Learning ... · Thumbs up? Sentiment Classification using Machine Learning Techniques • From above experiment, it is proved worthwhile

OUTLINE 1. Introduction2. Convolutional neural network architecture3. Training the network4. Related works5. Advantages of CharSCNN6. Experiment7. Model setup8. Results9. Conclusions10.Advised improvement 11.References

Course3111:Seminar:DataAnalyticsI 52

Page 53: Thumbs up? Sentiment Classification using Machine Learning ... · Thumbs up? Sentiment Classification using Machine Learning Techniques • From above experiment, it is proved worthwhile

INTRODUCTION WhatisSentimentanalysis?SentimentAnalysisisalsoknownasopinionmining,itistheprocessofdeterminingtheemotionaltoneofaseriesofwords

Whyshorttext?Intheeraofsocialmedia,weexpressouropinionwithlimitedwords

ProblemwithshorttextsentimentanalysisShottextcontainlimitedcontextualinformation,toanalyzeshorttextitrequiresstrategiesthatcombinethesmalltextcontentwithpriorknowledgeandusemorethanjustbag-of-wordsbutalsoextractinformationfromthesentence/messageinamoredisciplinedway.

Course3111:Seminar:DataAnalyticsI 53

Page 54: Thumbs up? Sentiment Classification using Machine Learning ... · Thumbs up? Sentiment Classification using Machine Learning Techniques • From above experiment, it is proved worthwhile

Whatstrategywasused?

• DeepconvolutionalneuralnetworknamedCharactertoSentenceConvolutionalNeuralNetwork(CharSCNN),withtwoconvolutionallayers toextractrelevantfeaturesfromwordsandsentencesofanysize.

• Twocorporaoftwodifferentdomains:theStanfordSentimentTreebank(SSTb),whichcontainssentencesfrommoviereviews;andtheStanfordTwitterSentimentcorpus(STS),whichcontainsTwittermessages.

Course3111:Seminar:DataAnalyticsI 54

Page 55: Thumbs up? Sentiment Classification using Machine Learning ... · Thumbs up? Sentiment Classification using Machine Learning Techniques • From above experiment, it is proved worthwhile

CONVOLUTIONAL NEURAL NETWORK

https://ujjwalkarn.me/2016/08/11/intuitive-explanation-convnets/

INPUT - willholdtherawpixelvaluesthesequenceofwordsinthesentence.CONVOLUTIONALLAYER- willcomputetheoutputofneuronsthatareconnectedtolocalregionsintheinput,eachcomputingadotproductbetweentheirweightsandasmallregiontheyareconnectedtointheinput.RELULAYER- willapplyanelementwiseactivationfunctionPOOLLAYER-willperformadownsamplingoperationalongthespatialdimensions(width,height).FULLY-CONNECTEDLAYER- willcomputetheclassscores,whereeachofthe5predictioncorrespondtoaclassscore.CharSCNN computesascoreforeachsentimentlabelτ∈ T.

Course3111:Seminar:DataAnalyticsI 55

Page 56: Thumbs up? Sentiment Classification using Machine Learning ... · Thumbs up? Sentiment Classification using Machine Learning Techniques • From above experiment, it is proved worthwhile

FIRST CONVOLUTIONAL LAYER

• Firstlayerofthenetworktransformswordsintoreal-valuedfeaturevectors(embeddings)thatcapturemorphological,syntacticandsemanticinformationaboutthewords

• Weuseafixed-sizedwordvocabularyVwrd

• Weuseafixed-sizedcharactervocabularyVchr (wordsarecomposedofcharacters)

• Givenasentencesof N words{w1 , w2, …, wN }, everywordwn isconvertedtoavectorun = [rwrd , rwch]

• Eachvectorforhasasub-vector,word-levelembeddingarerwrd∈ ℝdwrd andthecharacter-levelembeddingrwch∈ ℝcl0u foreverywn

Course3111:Seminar:DataAnalyticsI 56

Page 57: Thumbs up? Sentiment Classification using Machine Learning ... · Thumbs up? Sentiment Classification using Machine Learning Techniques • From above experiment, it is proved worthwhile

WORD-LEVEL EMBEDDING

http://www.joshuakim.io/understanding-how-convolutional-neural-network-cnn-perform-text-classification-with-word-embeddings/

• Theyaremeanttocapturesyntactic(structure of sentences)andsemantic(relationshipbetweenwords,phrases,signsandsymbols)information

• Weconvertawordwn intoitsword-levelembeddingrwrd byusingthematrix-vectorproductrwrd = Wwrd vw

• where vw isthevectorsizeisavectorofsize|Vwrd| whichhasvalue1atindexwandzeroinallotherpositions.

• ThematrixWwrd isaparametertobelearned.

Course3111:Seminar:DataAnalyticsI 57

Page 58: Thumbs up? Sentiment Classification using Machine Learning ... · Thumbs up? Sentiment Classification using Machine Learning Techniques • From above experiment, it is proved worthwhile

CHARACTER-LEVEL EMBEDDING

• Usedtoextractmorphological(howwordsareformed) andshape informationofwords

• Givenawordw composedofM characters{c1, c2, ..., cM},wefirsttransformeachcharactercm intoacharacterembeddingrm

chr

• Givenacharacterc,itsembeddingrchr isobtainedbythematrix-vectorproductrchr =Wchr vc

• wherevc isavectorofsize|Vchr|whichhasvalue1atindex c andzeroinallotherpositions.Theinputfortheconvolutionallayeristhesequenceofcharacterembeddings {r1

chr ,r2chr,...,rM

chr }Convolutionalapproachtocharacter-levelfeatureextraction.

Course3111:Seminar:DataAnalyticsI 58

Page 59: Thumbs up? Sentiment Classification using Machine Learning ... · Thumbs up? Sentiment Classification using Machine Learning Techniques • From above experiment, it is proved worthwhile

SENTENCE-LEVEL REPRESENTATION SecondConvolutionalLayerWhydoweuseaconvolutionallayerforsentence-levelrepresentation?Becauseextractingasentence-widefeaturesethavetwomainproblems:sentenceshavedifferentsizesand;importantinformationinasentencecanappearatanypositioninthesentence

• Givenasentencex withN words{w1,w2, ...,wN},whichhavebeenconvertedtojointword-levelandcharacter-levelembedding{u1, u2, ..., uN}

• Extractasentence-levelrepresentationrsentx .

• Thislayerproduceslocalfeaturesaroundeachwordinthesentenceandthencombinesthemusingamaxoperationtocreateafixed-sizedfeaturevectorforthesentence.

Course3111:Seminar:DataAnalyticsI 59

Page 60: Thumbs up? Sentiment Classification using Machine Learning ... · Thumbs up? Sentiment Classification using Machine Learning Techniques • From above experiment, it is proved worthwhile

TRAINING THE NETWORK

• NetworkistrainedbyminimizinganegativelikelihoodoverthetrainingsetD.• Givenasentencex,thenetworkwithparametersetθ computesascoresθ(x)τforeachsentimentlabel τ∈ T.

• weapplyasoftmax operationoverthescoresofalltagsτ∈ T:

• Weusestochasticgradientdescent(SGD)tominimizethenegativelog-likelihoodwithrespecttoθ :

• where(x, y) correspondstoasentenceinthetrainingcorpusD andy representsitsrespectivelabel.

• CharSCNN architecturewasimplementedusingtheTheano library

Course3111:Seminar:DataAnalyticsI 60

Page 61: Thumbs up? Sentiment Classification using Machine Learning ... · Thumbs up? Sentiment Classification using Machine Learning Techniques • From above experiment, it is proved worthwhile

• In(Chrupala,2013),theauthorproposesasimplerecurrentnetwork(SRN)tolearncontinuousvectorrepresentationsforsequencesofcharacters,andusethemasfeaturesinaconditionalrandomfieldclassifiertosolveacharacterleveltextsegmentationandlabelingtask.

• In(Collobert etal.,2011),theauthorsuseaconvolutionalnetworkforthesemanticrolelabelingtaskwiththegoalavoidingexcessivetask-specificfeatureengineering.Theauthorsuseasimilarnetworkarchitectureforsyntacticparsing.

Course3111:Seminar:DataAnalyticsI

RELATED WORKS

61

Page 62: Thumbs up? Sentiment Classification using Machine Learning ... · Thumbs up? Sentiment Classification using Machine Learning Techniques • From above experiment, it is proved worthwhile

ADVANTAGES OF CharSCNN

• Itusesafeed-forwardneuralnetworkinsteadofarecursiveoneanditdoesnotneedanyinputaboutthesyntacticstructureofthesentence.

• Theadditionofoneconvolutionallayertoextractcharacterfeatures.

• CharSCNN approachtoextractcharacter-levelfeaturesisitflexibility.

Course3111:Seminar:DataAnalyticsI 62

Page 63: Thumbs up? Sentiment Classification using Machine Learning ... · Thumbs up? Sentiment Classification using Machine Learning Techniques • From above experiment, it is proved worthwhile

EXPERIMENTDataset• ThemoviereviewdatasetusedistherecentlyproposedStanfordSentimentTreebank(SSTb)whichincludesfinegrainedsentimentlabelsfor215,154phrasesintheparsetreesof11,855sentences.

• StanfordTwitterSentimentcorpus(STS)theoriginaltrainingsetcontains1.6milliontweetsthatwereautomaticallylabeledaspositive/negativeusingemoticonsasnoisylabels

Course3111:Seminar:DataAnalyticsI 63

Page 64: Thumbs up? Sentiment Classification using Machine Learning ... · Thumbs up? Sentiment Classification using Machine Learning Techniques • From above experiment, it is proved worthwhile

WhyunsupervisedLearningofWord-LevelEmbeddings?• Becauserecentworkhasshownthatlargeimprovementsintermsofmodelaccuracycanbeobtainedbyperformingunsupervisedpre-trainingofwordembeddings

Howwastheunsupervisedlearningofword-levelembeddings wasperformed?• weperformunsupervisedlearningofword-levelembeddings usingtheword2vectool3,whichimplementsthecontinuousbag-of-wordsandskip-gramarchitecturesforcomputingvectorrepresentationsofwords

Course3111:Seminar:DataAnalyticsI

UNSUPERVISED PRE-TRAINING OF WORD-LEVEL EMBEDDINGS

64

Page 65: Thumbs up? Sentiment Classification using Machine Learning ... · Thumbs up? Sentiment Classification using Machine Learning Techniques • From above experiment, it is proved worthwhile

Datasetforpre-trainingofwordembeddings• December2013snapshotoftheEnglishWikipediacorpusasasourceofunlabeleddata

Processingthedataset1. removalofparagraphsthatarenotinEnglish2. substitutionofnon-westerncharactersforaspecialcharacter3. tokenizationofthetextusingthetokenizeravailablewiththe

StanfordPOSTagger4. removalofsentencesthatarelessthan20characterslong

(includingwhitespaces)orhavelessthan5tokens.5. welowercaseallwordsandsubstituteeachnumericaldigitbya0

(e.g.,1967becomes0000).

Course3111:Seminar:DataAnalyticsI 65

Page 66: Thumbs up? Sentiment Classification using Machine Learning ... · Thumbs up? Sentiment Classification using Machine Learning Techniques • From above experiment, it is proved worthwhile

Process• awordmustoccuratleast10timesinordertobeincludedinthevocabulary,whichresultedinavocabularyof870,214entries

• Totrainourword-levelembeddings weuseword2vec’sskip-grammethodwithacontextwindowofsize9.

• ThetrainingtimefortheEnglishcorpusisaround1h10minusing12threadsinaIntelXeonE5-26433.30GHzmachine

Course3111:Seminar:DataAnalyticsI 66

Page 67: Thumbs up? Sentiment Classification using Machine Learning ... · Thumbs up? Sentiment Classification using Machine Learning Techniques • From above experiment, it is proved worthwhile

MODEL SETUP• Developmentsetsisusedtotunetheneuralnetworkhyper-parameters• Spentmoretimewasspenttuningthe learningratethantuningotherparameters,sinceitisthehyper-parameterthathasthelargestimpactinthepredictionperformance

• Theonlytwoparameterswithdifferentvaluesforthetwodatasetsarethelearningrateandthenumberofunitsintheconvolutionallayerthatextractsentencefeatures

• thenumberoftrainingepochsvariesbetweenfiveandtenforthetwodataset

Course3111:Seminar:DataAnalyticsI 67

Page 68: Thumbs up? Sentiment Classification using Machine Learning ... · Thumbs up? Sentiment Classification using Machine Learning Techniques • From above experiment, it is proved worthwhile

Neural Network Hyper-Parameters• weuse4threadsinaIntelXeonE5-26433.30GHzmachine.• Theano basedimplementationofCharSCNN takesaround10min.tocompleteonetrainingepochfortheSSTb corpuswithallphrasesandfiveclasses.

Course3111:Seminar:DataAnalyticsI 68

Page 69: Thumbs up? Sentiment Classification using Machine Learning ... · Thumbs up? Sentiment Classification using Machine Learning Techniques • From above experiment, it is proved worthwhile

RESULTS FOR SSTB CORPUS

• wecheckwhetherusingexamplesthataresinglephrases,inadditiontocompletesentences,canprovideusefulinformationfortraining

• Phrasesindicateswhetherallphrases(yes)oronlycompletesentences(no)inthecorpusareusedfortraining.

• TheFine-Grainedcolumncontainspredictionresultsforthecasewhere5sentimentclasses(labels)areused(verynegative,negative,neutral,positive,verypositive).

Course3111:Seminar:DataAnalyticsI 69

Page 70: Thumbs up? Sentiment Classification using Machine Learning ... · Thumbs up? Sentiment Classification using Machine Learning Techniques • From above experiment, it is proved worthwhile

NOTE

• CharSCNN andSCNNhaveverysimilarresultsinbothfine-grainedandbinarysentimentprediction.Theseresultssuggestthatthecharacter-levelinformationisnotmuchhelpfulforsentimentpredictionintheSSTbcorpus.

• Usingphrasesastrainingexamplesallowstheclassifiertolearnmorecomplexphenomena,sincesentimentlabeledphrasesgivetheinformationofhowwords(phrases)combinetoformthesentimentofphrases(sentences).

• ComparedtoRNTN,CharSCNN hastheadvantageofnotneedingtheoutputofasyntacticparserwhenperformingsentimentprediction

Course3111:Seminar:DataAnalyticsI 70

Page 71: Thumbs up? Sentiment Classification using Machine Learning ... · Thumbs up? Sentiment Classification using Machine Learning Techniques • From above experiment, it is proved worthwhile

RESULTS FOR STS CORPUS

Course3111:Seminar:DataAnalyticsI

• character-levelinformationhasagreaterimpactforTwitterdata• Usingunsupervisedpre-training,CharSCNN providesanabsolute

accuracyimprovementof1.2overSCNN

71

Page 72: Thumbs up? Sentiment Classification using Machine Learning ... · Thumbs up? Sentiment Classification using Machine Learning Techniques • From above experiment, it is proved worthwhile

CONCLUSIONS

Course3111:Seminar:DataAnalyticsI

• positivesentence(left)anditsnegation(right).• theextractedfeaturesconcentratemainlyaroundthemaintopic,“film”,and

thepartofthephrasethatindicatessentiment(“liked”and“did’nt like”).• leftchartthattheword“liked”hasabigimpactinthesetofextractedfeatures.• rightchart,wecanseethattheimpactoftheword“like’’isreducedbecauseof

thenegation“did’nt”,whichisresponsibleforalargepartoftheextractedfeatures.

72

Page 73: Thumbs up? Sentiment Classification using Machine Learning ... · Thumbs up? Sentiment Classification using Machine Learning Techniques • From above experiment, it is proved worthwhile

Course3111:Seminar:DataAnalyticsI

• negativeexpression“incrediblydull”isresponsibleforthefeaturesextractedfromthesentenceintheleft

• “definitelynotdull”,whichissomewhatmorepositive,isresponsibleforthefeaturesextractedfromthesentenceinthechartatright.

Whydoweusenegation?• negationisanimportantissueinsentimentanalysis

73

Page 74: Thumbs up? Sentiment Classification using Machine Learning ... · Thumbs up? Sentiment Classification using Machine Learning Techniques • From above experiment, it is proved worthwhile

ADVISED IMPROVEMENT

• Usethreeconvolutionallayer,firstlayerforcharacterlevelembedding,secondforwordlevelembeddingandthirdforsentencelevelembedding

• Performunsupervisedpre-trainingatcharacter-levelrepresentations

Course3111:Seminar:DataAnalyticsI 74

Page 75: Thumbs up? Sentiment Classification using Machine Learning ... · Thumbs up? Sentiment Classification using Machine Learning Techniques • From above experiment, it is proved worthwhile

REFERENCES• https://www.brandwatch.com/blog/understanding-sentiment-analysis/• https://deeplearning4j.org/word2vec.html• http://cs231n.github.io/convolutional-networks/• https://www.clickworker.com/2017/03/14/sentiment-analysis-what-is-it-for/• https://ujjwalkarn.me/2016/08/11/intuitive-explanation-convnets/• https://adeshpande3.github.io/A-Beginner%27s-Guide-To-Understanding-Convolutional-Neural-Networks-Part-2/

Course3111:Seminar:DataAnalyticsI 75