research using twitter social media...
TRANSCRIPT
1
ResearchUsingTwitterSocialMediaDataTwitterisanextensivesourceofsocialmediadata,andunusualinthatithasbothanextensivedatasharinginfrastructureandanexplicitlypermissiveTermsofService.ThistutorialdocumentshowshowtosocialscientistscanuseTwitterasasourceofresearchdata,withoutanytechnicalbackgroundorprogrammingability.Itshowshowtousecommonlyavailabledatacaptureandanalysistools,andthekindsofresearchmethodsandinvestigationsthattheycansupport.NoteonSoftwareRequiredforthisTutorialTofollowthistutorial,youwillneedtousetheChromebrowsertoaccessTwitter’swebsite.Thedatacapturetool(WebDataRA)isaChromebrowserextensionthatinterpretstheTwitterWebpagesandcapturesthemasspreadsheetdatathatyoucanpasteintoMicrosoftExcel(WindowsorMac).YouwillalsouseGephi,anopensourcenetworkanalysisandvisualisationapplicationwhichyoucandownloadandinstallfromgephi.org.ThisexpectsarecentversionoftheJavaRuntimeEnvironment(JRE)tobeinstalledonyourmachine.
InstallingandUsingWebDataRATheWebDataRAwillcaptureTwitter,FacebookandGoogledatafromabrowserandallowyoutopasteatableofinformationdirectlyintoaspreadsheet.ThisdocumentfocusesonitsusewithTwitter.1) InstalltheWebDataRAbrowserextensionintoChromebyvisiting
bit.ly/WebDataRAinChrome,andclickingontheblue“+AddtoChrome”button.Thesmallgreenicon willappearinthetoprightofthebrowserwindow,nexttotheURLbar.
2) GotoTwitter.comandcreateaTwittersearchordisplayatimeline3) ClickontheWebDataRAicon tostartcollectingtweets.Everyfive
secondsthebrowserwillautomaticallyscrolltothebottomofthepagetomakeTwitterloadthenextbatchofresultsandcopythedataautomaticallytotheclipboard.
4) Whenyouhavecollectedenoughresults,pastethedataintoanExcelspreadsheet.5) UseExceltoanalysedata,orexporttootherprogramssuchasGephiorVoyantforother
kindsofanalysis.
2
OverviewWhenyoupastedatafromWebDataRAyouwillcreatefourtablesinyourspreadsheet,appearingbeloweachother.
Thetweetdata,withauthor,mentions,hashtags,textandcountsofretweets,repliesandlikesbrokenoutinseparatecolumns.
Accountoccurrencesummary,acountofthenumberoftimesthateachTwitteraccountappearsinthedatasetasauthororamention(includingthenumberofretweets).
Countsoftheappearancesofeachhashtag.
Atableofedgesoftheconversationalnetwork,i.e.thenumberoftimeseachpairofaccountscommunicatewitheachother.
Youcanusethisdatainvariousways:(a) Thetweetdata(gray)containsthebasicdataabouteachtweet:whatwassaid,
when,bywhoandtowhom.Youcanusethisdatatoformageneraloverviewofthecommunicationovertimeandidentifythemostsignificanttweets.YoucanalsoexaminespecifictweetsandtheircontextbyreferringbacktotheTwittersiteusingeachtweet’sURL.
(b) Theaccounttable(green)showsyouthemostactivetweeters,themostfrequentrepliers,andthemostretweetedusers.Thiswillhelpyouseethekeyactorsinaconversation,andthemainrolesthattheytake.Youcanfollowupbyclickingontheaccountnamestolookattheaccountbiosandtherelevanttimelinesoftheseactorstounderstandwhethertheyarecorporateaccounts,privateindividuals,botsortrolls.
(c) Thehashtagtable(blue)showsyouthemostfrequentlyusedhashtags.Thiscanhelpyouextendyourdatagatheringtolookformoretweetsrelevanttoyourresearchquestion.
(d) Theedgetablewillhelpyoutoseetheinteractionsbetweenactors,andhelpyoutounderstandgroupingsofactors,andthepatternoftheirinteraction.Isakeyaccountdominatingaconversationandtalkingtomanyothers?Aretheyrespondingorjustbeingpassiverecipientsofmarketingmessages?Isthereagroupofequalshavingabalancedconversationwithequalparticipation?
3
UsingExcelforSimpleQuantitativeAnalysesThefollowingsetoftweets(graytable)comesfromaTwittersearchforthephrase“DigitalDetox”.
TheeasiestwaytoseeanoverviewofaTwittertimelineistocreateaPivotTable.Clickonanygraycell,andchoose“PivotTable”fromtheInsertribbon.InthePivotTablebuilder,drag“Author”fromtheFieldNamepanelintothe“Rows”panel,drag“Timestamp”intothe“Columns”panel,anddrag“Author”(again)intothe“Values”panel(itwillautomaticallyturninto“CountofAuthor”).
Thescreenshotaboveshowsthedatesincludedinthetwittersampleasgreencolumnheadings,andtheaccountsthatauthoredtweetsasrowlabelsalongthelefthandside,orderedbymostprolifictweeters.Thevaluesineachcellarethenumberoftweetsauthoredbyaspecificaccountonaspecificday.Youcanadjusttheformattingforconvenience(Inarrowedthecolumnsandslantedthecolumnheadingsandchangedtheangleofthetextto60degreestofit),usethe“Row
4
Labels”controltosortbytheauthorcount(i.e.thenumberoftweetsanauthorcreated)andshowonlytherowswherethetotalauthorcountisgreaterthanachosenthreshold.Youcanalsouseconditionalformattingtocolourthecellstohighlightthemostextremevalues.AllkindsofsummariesandanalysesarepossibleusingExcelonthisdata,including:
• Showingthedistributionofthetweetsamplethroughtime• Identifyingthemostprolificand/orpopularactors,andshowingtheiractivity
throughtime• Showingtheuseofindividualhashtags(thismightbeusefulinabigconversation,or
onethatevolvesoveralongerperiod)• Comparingtherelativeproportionofcontributionsfromdifferentactors/hashtags
AlloftheseanalyseswillleadontootherquestionsthatcanbeaskedbygoingbacktoTwitter.Theaccountnamesintheaccount“authorandmentions”(green)tableareclickable,andopenthepageoftheaccountprofileinyourdefaultwebbrower.Alternatively,tomakeasetofaccountnamesintoclickablehyperlinksgivingbrowseraccesstotheuser’sTwittertimelineandbio,usethefollowingfunction:=HYPERLINK(CONCATENATE("http://twitter.com/",A9),CONCATENATE("@",A9)) whereA9isanexampleofacelladdressthatcontainsaTwitteraccountnamee.g.lescarr.Followingtheaccounthyperlinksforthemostprolificauthorsinthegreentable,weseethattheyareallcommercialactorstooneextentoranother.Account # BioItsTimeToLogOff 30 TimeToLogOffisthehomeofdigitaldetox.We’respearheadingthe
movementtodisconnectregularlyfromdigitaldevicesandreconnectwiththeworldoffline.Wedothisthroughcollectingfactsontheneedfordigitaldetox,runningcampaignstogeteveryoneofftheirscreensandhostingretreats,eventsandworkshops.
DinnerTableMBA 9 Acommercialorganisationworkingtogethertohelpfamiliesbecomemoreconfident,successful,andself-empowered
SpareFoot 8 Astoragecompany.Wemakeiteasytomoveandstoreyourstuff.Reservestorageforfreeandgetyourmindoutoftheclutter.
CultureEffect 5 AuthorofDigitox:HowtoFindaHealthyBalanceforyourFamily’sDigitalDiet
UsingGephiforNetworkAnalysesToseehowthevariousaccountsinteractwitheachotherasanetwork,copyandpastetheyellowtableintoaseparatespreadsheetandsaveitasaCSVfile(callitedgetable.csvor
3.OneclicktoopenTwitteraccountinbrowser
2.Functionturnsaccountnameintolink
1.Functionusesaccountname
5
similar).Loadupthenetworkvisualisationprogram“Gephi”,andstartanewproject.Inthe“DataLaboratory”,choose“ImportSpreadsheet”andloaduptheCSVdataasanedgetable.Youcanthenapplyavarietyofnetworklayoutalgorithmsinthe“Overview”pane.
Thenetworkvisualisationshowsmanyisolatednodes(accounts)inanouterringandacentralcoremadeupofdifferentgroupsofaccounts.Manyofthesearelooselyconnected“chains”of2-6accountswhereoneaccounthasmentionedanother,whichhasmentionedanotherandsoon.Therearemorecomplicatedsubnetworkcomponentsthatdemonstratemoreactivity,asseenbelow.
Thegreencomponentisdominatedbyasinglecorporateaccount(themostprolificaccountinthissample)whoseroleistopromotetheideaofadigitaldetoxandthat“tweetsat”manyotheraccounts,initiatingcommunicationwiththem.Bycontrast,therednetworkconsistsofalargergroupofteachersandeducationprofessionalswhoalreadyparticipateinalargerprofessionalnetworkwithinTwitter,andwhoarediscussingthetopicofdigitaldetoxwithinthatcontext.ManysummariesandanalysesarepossibleusingGephi’snetworkvisualisationtool:
• Showingtheinteractionofthenetworkactors• Identifyingthecommunitiesandactiveparticipantsubgroupswithinthelarger
sample• Identifyingtherolesofdifferentactorsinthecommunicationsnetwork
6
Pleasenote,thatmanynetworkanalysesintheliteratureoftenfocusontheretweetnetwork,butthatisnotpossiblewithWebDataRA,becausetheidentityoftheretweetingaccountsisnotavailabletotheWebbrowser(theWebUserinterfaceorWebapp).Youhavedataonthenumberofretweets(i.e.thepopularityofthetweet),butnottheaccountsthatretweetedtheoriginalmessage.Formoreinformationonasupplementaryservicebeingdevelopedtofillthisgap,seetheendofthisdocument.
UsingVoyantforTextualAnalysesInthegraytable,copythe“SanitisedText”column.Thiscontainsthetextofallthetexts,butwithalltheTwitterfeatures(@names,#hashtags,URLs)removedtoleaveonlytheEnglishtext.GototheVoyant-Tools.orgwebsite,pastethetextintothetextboxandpressthe“Reveal”button.Youwillseeascreenwithseveralpanelsthathelpyouexplorethetextofthetweetsindifferentways.VoyantToolsisatextualcorpusanalyser.Itconsidersthedatathatyouhaveenteredasasingledocumentwheretheindividualtweetsarelikeindividualsentencesorparagraph.TheWordcloud,Reader,TrendandConcordanceallanalysethetextfromthecollectionoftweets.ClickonawordintheWordcloud,andallitsoccurrenceswillbehighlightedintheReaderpanel,it’sfrequencythroughoutthewholedocument(setoftweets)willbedisplayedintheTrendgraph,anditscontextwillbedisplayedintheConcordance.Thishelpsyoutoinvestigatetheuseoflanguageinthecollectionoftweets,andquicklyunderstandwhatisbeingtalkedaboutandhow.Italsohelpsyoutoseehowthelanguagechangesovertime.Thefirstthingthatthisdisplayshowsisthatthemostcommontermsaredigital,detox,anddigitaldetoxbecausetheywerethesearchterms!Toignorethem,addthemtothestopwordslistbyclickingonthe“Options”iconintheWordcloudpanel’sgreyiconbar.
Whenthemouseentersthebar,youwillseethefollowingiconsappear: .The“Options”iconistheslidingbutton,nexttothe“Help”questionmark.
ToedittheStopwordslist,clickonthe“EditList”buttonandaddthethreeextratermstobeignored,oneperline.
7
Youwillseethatthepanelshavere-analysedthetext,ignoringthetermsthatyouhaveaddedtothestopwordlist.
Othermorecomplexvisualisationsandanalysesareavailabletobeused,includingadvancedMachineLearningalgorithmstoclusterkeywordsandsimplegraphvisualisationstoshowkeywordco-occurrence.
StreamGraph DimensionalReductionsofKeywordAppearance
NetworkofKeywordCo-occurrences
8
AnalysisofTwitterLanguageThewordcloudshowsinvisualformthemostcommonlyusedtermsintheTwittersample.Thisisgreatforanimpressionofthetopics,butamoreusefulsummaryisthe“Terms”panel.Using15ofthetop20wordsinthissample,wemighthypothesisethatadigitaldetoxisaboutspendinglesstimeusingaphoneandtheneedtounplugyourselffromsmartphonesocialmediatechnologyapps–it’sahealthchallengeyoutryforaday,aweekorayear.
Aneasywaytoinvestigatetheuseofeachofthesewordsistoselecttheminthetermpanel,andscrollthroughthein-contextuseinthe“Context”panel.Themostcommonlyusedwordis“time”butit’smainuseisnotinthesenseof“spendingtoomuchtime”or“savingtime”butinthesenseofanopportunity(it’stimeto…orit’stimefor…)andfrequentlyasarhetoricalquestion(Isn’tittimeforadigitaldetox?)FindingtweetsthatspeakfrompersonalexperienceManyofthetweetsseemtobepromotionallifestyletweetsinheadlineform(e.g.“WhyYouShouldDoaDigitalDetox”or“DigitalDetoxBenefits:Areyouaddictedtotechnology?”)Toidentifypersonaltweetsfrompeoplewhohavetriedorarethinkingoftryingadigitaldetox,searchfortweetscontainingthepersonalpronoun“I”.Unfortunately,“I”isoneoftheVoyantToolsstoplistwordsandisignored(astoplistconsistsofthecommonbutlow-informationwordsinatextincludingpronouns,prepositionsandconjunctions).AlthoughitispossibletoeditthestoplistinVoyant,IwillsearchforthestringI<space>orI<apostrophe>inthetextofthetweet(tomatchI,Iam,Ihave,Iwill,I’m,I’ve,I’ll).Usethefollowingformula =OR(NOT(ISERROR(FIND("I ",A1))),NOT(ISERROR(FIND("I'",A1)))) todefineanextracolumninthegraytable,andsortthesetbytheresults(TRUEorFALSE).Only17%ofthetweetsinthiscollectionwereidentifiedaspersonalbythissimplerule.(Note,noteverytweetwiththewordIistalkingfromafirstpersonperspective,andmanytweeterswouldmiss“Ihave”or“Iam”fromthebeginningofatweetortext.)Ofthose17%,justoverhalfwereactivelyplanningadigitaldetox,weredoingadigitaldetoxorhaddoneone.(Theironyoftweetingaboutdoingadigitaldetoxisnotlostonsomecommentators!)
9
Itisthenpossibletofollowupwitheachoftheaccountsthathastweetedaboutanactualperiodofdigitaldetoxthattheyhaveundertaken,toexaminetheirTwitteractivitybefore,duringandafterthedetoxandtoidentifyanyquantitativedifferenceintheiruseoftheplatform.Asanexample,Ichoseasingleaccountthathadannouncedthattheywerestartinga“DigitalDetox”todeprivethemselvesofsocialmediaforaday.Iaccessedallthetweetsofthatuserovertheperiod(usingWebDataRA),andcreateda1-authorpivottable(asabove)toaggregatethetweetstoacountperday,andthenplottedthatdataasascattergraphwith7-daymovingaverage.Thegraphshowsthatalthoughtheaccountdidnottweetonthechosendate(markedwitharedcross)thereisnonoticeablechangeintweetingbehaviouraftertheannounceddetoxday.
AdditionalMethods:SentimentAnalysisSentimentanalysiscanhelpyouidentifypositiveornegativecommentsinyoursample.Thisisapopularmethodinindustry,especiallywithbrandmanagementcompanies.Howeveritisacademicallycontested,anddoesnothaveahighdegreeoftransparencyinthelexicalprocessing. Nevertheless,totryitout,pastethe“SantitisedText”columnintoanonlineservicesuchassentigem.com.Considertowhatextenttheresultsseemaccuratetoyou–howwelldoesitidentifypositiveandnegative‘sentiment’inasentence?Whatkindsofinaccuraciescanyousee?And,mostimportantly,despiteanyshortcomings,doesithelpyoutoidentifyanypointsofinterestinyourdataformorethoroughinvestigation?
7% 11%
22%
23%
25%
12%
PersonalTweetsaboutDigitalDetox
WantToDo
PlanToDo
Doing
Did
Suggestions
Opinions
10
AdditionalMethods:RetweetNetworkAsupportingserviceisbeingdevelopedforWebDataRAtoallowyoutorecovertheretweetnetworkaroundthetweetsthatyouhavecollected.Aspreviouslyexplained,theTwitterwebappdoesnotshowtheretweetsofeachtweet,onlythecountofretweets.However,itispossibletorequestthatdatafromtheTwitterAPI,althoughitislimitedto100retweetsofeachtweet.(Ifatweethasbeenretweetedmorethan100times,thoseextratweetswillnotbeaccessible,assuchthisshouldbeconsideredtobeanapproximationoftherealretweetnetwork.)Toobtaintheretweetdataforyourtweets,selectthetwitteridsforwhichyouwishtofindtheretweets,andpastetheseintotheformathttp://pretend.webdataRA/retweets/.PleasenotethattheTwitterAPIlimitsrequestssuchthateachtwitterIDthatyouprovidewilltake12secondstoprocess,andthefinishedresultmaytakeanhourormore.YoucanpastetheresultingdataintoanExcelspreadsheet,saveitasCSVandimportitintoGephiaspreviouslyexplained.IntheGephinetworkvisualisationbelow,nodesarelargerandcolouredmoreintenselyaccordingtohowmuchothernodesretweetthem(theirroleasauthoritiesinthenetwork),butthenodelabelsaresizedaccordingtowhethertheyretweetorareretweeted.Soyoucanseethatsomeaccountsareveryactivebutnotasoriginatorsofinformation,whereasotheraccountsmaybelessactivebutaremoreinfluential.
11