computationally analyzing social media text for topics: a primer...

13
Computationally Analyzing Social Media Text for Topics: A Primer for Advertising Researchers Joseph T. Yun a,b , Brittany R. L. Duff b , Patrick T. Vargas b , Hari Sundaram c , and Itai Himelboim d a Department of Accountancy, Gies College of Business, University of Illinois at UrbanaChampaign, Champaign, Illinois, USA; b Charles H. Sandage Department of Advertising, College of Media, University of Illinois at UrbanaChampaign, Champaign, Illinois, USA; c Department of Computer Science, Grainger College of Engineering, University of Illinois at UrbanaChampaign, Champaign, Illinois, USA; d Department of Advertising and Public Relations, Grady College of Journalism and Mass Communication, University of Georgia, Athens, Georgia, USA ABSTRACT Advertising researchers and practitioners are increasingly using social media analytics (SMA), but focused overviews that explain how to use various SMA techniques are scarce. We focus on how researchers and practitioners can computationally analyze topics of conversation in social media posts, compare each to a human-coded topic analysis of a brands Twitter feed, and provide recommendations on how to assess and choose which computational methods to use. The computational methodologies that we survey in this article are text preprocessed summarization, phrase mining, topic modeling, supervised machine learning for text classification, and semantic topic tagging. KEYWORDS Social media analytics; topic discovery; topic modeling; computational advertising; machine learning Social media analytics (SMA) is the process of using computational methods and tools to extract insights from social media data (Fan and Gordon 2013) as well as measure the performance of social media cam- paigns (Murdough 2009). SMA is widely used by companies and academia to measure and analyze con- sumersreactions to advertising phenomena such as brand promotions and new trends (e.g., Daniel, Crawford Jackson, and Westerman 2018; Fulgoni 2015; Kwon 2019; Moe, Netzer, and Schweidel 2017; Rosenkrans and Myers 2018; Yun and Duff 2017). The growth of SMA parallels the growth of social media use, as a recent Pew Research Center survey showed that 75% of adults in the United States use social media, and this percentage grows to 94% for 18- to 24-year-olds (Smith and Anderson 2018). While public social media posting creates a large set of data for advertising researchers and practitioners, the use of these data has been largely constrained to those with more technical knowledge in computation and algorithms or those with access to professional tools that may be cost-prohibitive. While there have been several recent academic papers that help researchers discern how to go about analyzing computational data for marketing and advertising research, most of these papers function as high-level overviews (e.g., Humphreys and Wang 2018; Liu, Singh, and Srinivasan 2016; Malthouse and Li 2017; Murdough 2009; Rathore, Kar, and Ilavarasan 2017) without giving details on how to actually go about conducting analyses. More focused how-toarticles on specific computational methods for research are limited, especially how to detect topics of conversation in social media text for the purposes of advertising. As far as we know, there are no articles detailing what to consider when one wants to detect topics of CONTACT Joseph T. Yun [email protected] Department of Accountancy, Gies College of Business, University of Illinois at UrbanaChampaign, 2046 Business Instructional Facility, 515 East Gregory Drive, Champaign, IL 61820, USA. Joseph T. Yun (PhD, University of Illinois at UrbanaChampaign) is a research assistant professor of accountancy and director of the Data Science Research Service in the Gies College of Business, University of Illinois at UrbanaChampaign. Brittany R. L. Duff (PhD, University of Minnesota) is an associate professor of advertising in the Charles H. Sandage Department of Advertising, College of Media, University of Illinois at UrbanaChampaign. Patrick T. Vargas (PhD, Ohio State University) is a professor of advertising in the Charles H. Sandage Department of Advertising, College of Media, University of Illinois at UrbanaChampaign. Hari Sundaram (PhD, Columbia University) is an associate professor of computer science and advertising in the Charles H. Sandage Department of Advertising, College of Media, University of Illinois at UrbanaChampaign. Itai Himelboim (PhD, University of Minnesota) is the Thomas C. Dowden Professor of Media Analytics, director of the Social Media Engagement and Evaluation Suite, and an associate professor of advertising in the Grady College of Journalism and Mass Communication, University of Georgia. ß 2019 American Academy of Advertising JOURNAL OF INTERACTIVE ADVERTISING https://doi.org/10.1080/15252019.2019.1700851

Upload: others

Post on 31-Jul-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Computationally Analyzing Social Media Text for Topics: A Primer …sundaram.cs.illinois.edu/pubs/2019/2019yun_primer.pdf · 2019-12-31 · In their book on text mining, Miner et

Computationally Analyzing Social Media Text for Topics: A Primer forAdvertising Researchers

Joseph T. Yuna,b , Brittany R. L. Duffb , Patrick T. Vargasb, Hari Sundaramc, and Itai Himelboimd

aDepartment of Accountancy, Gies College of Business, University of Illinois at Urbana–Champaign, Champaign, Illinois, USA; bCharlesH. Sandage Department of Advertising, College of Media, University of Illinois at Urbana–Champaign, Champaign, Illinois, USA;cDepartment of Computer Science, Grainger College of Engineering, University of Illinois at Urbana–Champaign, Champaign, Illinois,USA; dDepartment of Advertising and Public Relations, Grady College of Journalism and Mass Communication, University of Georgia,Athens, Georgia, USA

ABSTRACTAdvertising researchers and practitioners are increasingly using social media analytics (SMA),but focused overviews that explain how to use various SMA techniques are scarce. We focuson how researchers and practitioners can computationally analyze topics of conversation insocial media posts, compare each to a human-coded topic analysis of a brand’s Twitterfeed, and provide recommendations on how to assess and choose which computationalmethods to use. The computational methodologies that we survey in this article are textpreprocessed summarization, phrase mining, topic modeling, supervised machine learningfor text classification, and semantic topic tagging.

KEYWORDSSocial media analytics; topicdiscovery; topic modeling;computational advertising;machine learning

Social media analytics (SMA) is the process of usingcomputational methods and tools to extract insightsfrom social media data (Fan and Gordon 2013) aswell as measure the performance of social media cam-paigns (Murdough 2009). SMA is widely used bycompanies and academia to measure and analyze con-sumers’ reactions to advertising phenomena such asbrand promotions and new trends (e.g., Daniel,Crawford Jackson, and Westerman 2018; Fulgoni2015; Kwon 2019; Moe, Netzer, and Schweidel 2017;Rosenkrans and Myers 2018; Yun and Duff 2017).The growth of SMA parallels the growth of socialmedia use, as a recent Pew Research Center surveyshowed that 75% of adults in the United States usesocial media, and this percentage grows to 94% for18- to 24-year-olds (Smith and Anderson 2018).

While public social media posting creates a largeset of data for advertising researchers and

practitioners, the use of these data has been largelyconstrained to those with more technical knowledgein computation and algorithms or those with access toprofessional tools that may be cost-prohibitive. Whilethere have been several recent academic papers thathelp researchers discern how to go about analyzingcomputational data for marketing and advertisingresearch, most of these papers function as high-leveloverviews (e.g., Humphreys and Wang 2018; Liu,Singh, and Srinivasan 2016; Malthouse and Li 2017;Murdough 2009; Rathore, Kar, and Ilavarasan 2017)without giving details on how to actually go aboutconducting analyses. More focused “how-to” articleson specific computational methods for research arelimited, especially how to detect topics of conversationin social media text for the purposes of advertising.As far as we know, there are no articles detailing whatto consider when one wants to detect topics of

CONTACT Joseph T. Yun [email protected] Department of Accountancy, Gies College of Business, University of Illinois at Urbana–Champaign, 2046Business Instructional Facility, 515 East Gregory Drive, Champaign, IL 61820, USA.Joseph T. Yun (PhD, University of Illinois at Urbana–Champaign) is a research assistant professor of accountancy and director of the Data ScienceResearch Service in the Gies College of Business, University of Illinois at Urbana–Champaign.Brittany R. L. Duff (PhD, University of Minnesota) is an associate professor of advertising in the Charles H. Sandage Department of Advertising, College ofMedia, University of Illinois at Urbana–Champaign.Patrick T. Vargas (PhD, Ohio State University) is a professor of advertising in the Charles H. Sandage Department of Advertising, College of Media,University of Illinois at Urbana–Champaign.Hari Sundaram (PhD, Columbia University) is an associate professor of computer science and advertising in the Charles H. Sandage Department ofAdvertising, College of Media, University of Illinois at Urbana–Champaign.Itai Himelboim (PhD, University of Minnesota) is the Thomas C. Dowden Professor of Media Analytics, director of the Social Media Engagement andEvaluation Suite, and an associate professor of advertising in the Grady College of Journalism and Mass Communication, University of Georgia.� 2019 American Academy of Advertising

JOURNAL OF INTERACTIVE ADVERTISINGhttps://doi.org/10.1080/15252019.2019.1700851

Page 2: Computationally Analyzing Social Media Text for Topics: A Primer …sundaram.cs.illinois.edu/pubs/2019/2019yun_primer.pdf · 2019-12-31 · In their book on text mining, Miner et

conversation from social media text for advertising.This is an important area of investigation becausecomputational advertising is the primary revenuestream for companies like Google (Schomer 2019),and one of the core analytics methods used in compu-tational advertising is computational text analysis oftopics (Soriano, Au, and Banks 2013).

The focus of this article is to provide considerationto advertising researchers when computationallydetecting topics of conversation from social mediatext. First, we discuss and compare several ways tocomputationally detect topics of conversation in socialmedia text: text preprocessed summarization, phrasemining, topic modeling, supervised machine-learnedtext classification, and semantic topic tagging. Second,we present the results of comparing the various meth-ods against a human-coded Twitter data set in whichcoders identified topics of conversation via a humancontent analysis approach. Finally, we provide a refer-ence matrix that outlines a summary of each method,when to consider using that method, and tools to exe-cute that method.1

Topic Discovery in Mass CommunicationResearch, Consumer Research, andAdvertising Research

There has been previous research regarding topic dis-covery in the areas of mass communication research,consumer research, and advertising research.Humphreys and Wang (2018) recently outlinedresearch steps for incorporating automated text ana-lysis into consumer research. In their breakdown ofapproaches, they discuss topic discovery but focusonly on unsupervised learning (topic modeling). Webroaden the possibilities of topic discovery within ourarticle to include text summarization (counting) aswell as supervised learning. In their overview of jour-nalism-focused computational research Boumans andTrilling (2016) suggested that to see more computa-tional research produced, researchers should buildcustom-made tools and share not only their resultsbut also their code. Our focus digs more deeply intohow to detect topics of conversation within socialmedia text. In addition, we provide the links to ourcode for those researchers who have more advancedcomputational skills, and we went a step further andbuilt the methods we describe into an open-sourcetool called the Social Media Macroscope (SMM),which is available to all researchers, so that those withless of a technical background can use these methodsfor their research.

Numerous advertising and marketing studies haveincorporated automated topic detection in theirresearch. Liu, Burns, and Hou (2017) applied latentDirichlet application (LDA) topic modeling to tweetsto understand what topics brands were discussingoverall. They separated the brand-generated tweetsinto positive and negative, and applied LDA topicmodeling to the positive tweets and negative tweets tounderstand if different topics were being discussed.Liu (2019) applied text preprocessing and LDA topicmodeling in her efforts to discover a relationshipbetween tweets and financial stock prices usingmachine learning. Vermeer et al. (2019) analyzedFacebook brand pages and used supervised learning todetect whether consumer posts fell into the followingcategories: rejection, complaint, comment, question,suggestion, acknowledgment, and/or compliment.Although these categories are a bit different thandetecting the topic a consumer may be discussing, theprocess they used to categorize the posts (supervisedlearning) is the same process through which myriadtopics could be detected. Okazaki et al. (2015) con-ducted a similar study using supervised learning tounderstand how consumers were discussing IKEA (ahome furnishing brand) on Twitter and divided con-sumer conversations into the categories of sharing,information, opinion, question, reply, and exclude.Liu, Singh, and Srinivasan (2016) used principal com-ponent analysis (PCA) to summarize what topics werebeing discussed on Twitter by consumers of televisionprogramming. PCA is like LDA topic modeling in thesense that it provides groups of words that statisticallybelong in the same groupings as each other, but PCAignores class labels as compared to LDA. We focus onLDA topic modeling in this article due to its popular-ity over its PCA counterpart, but interested readersshould see Mart�ınez and Kak (2001) for more infor-mation about the pros and cons of LDA versus PCA.

Recently, Malthouse and Li (2017) suggested thatbig data analyses in advertising research coulduncover “exploratory insights to create better messagesand motivate new theories … [and] can provideinsights on what consumers think and feel about abrand” (p. 230). Our article directly builds fromMalthouse and Li (2017), as we provide instructionson how researchers can conduct analyses of topics ofconversation within social media text to understandwhat consumers think and feel about a brand. Takingthis even further, topics of conversation can be ana-lyzed to understand the profile of an audience moreaccurately, and this profile could be used to better

2 J. T. YUN ET AL.

Page 3: Computationally Analyzing Social Media Text for Topics: A Primer …sundaram.cs.illinois.edu/pubs/2019/2019yun_primer.pdf · 2019-12-31 · In their book on text mining, Miner et

create advertising messages (Kietzmann, Paschen, andTreen 2018).

Detecting Topics of Conversation in SocialMedia Text

Here we briefly discuss what constitutes a topic ofconversation computationally. For the purposes of thisarticle, we adhere to the definition of a conversationaltopic as “short descriptions of the members of a col-lection that enable efficient processing of large collec-tions while preserving the essential statisticalrelationships that are useful for basic tasks such asclassification, novelty detection, summarization, andsimilarity and relevance judgments” (Blei, Ng, andJordan 2003). In addition, the realm of social media isvast, making “social media data” a very broad term.Therefore, for the purposes of this article, we limitour scope of social media data to short form/micro-blogged text, such as status updates (e.g., the text ofTwitter, Facebook, Instagram, and Reddit posts).

Our list of methodologies is not meant to beexhaustive, but they should give advertising research-ers who are newer to social analytics a satisfactorystarting place for engaging in detecting topics. Weprovide a high-level summary of five methods ofdetecting topics: text preprocessed summarization,phrase mining, topic modeling, supervised machine-learned text classification, and semantic topic tagging.Some of these methods are subsets of another (e.g.,semantic topic tagging is a subset of supervisedmachine-learned text classification), but we havepulled each of them out individually due to specialconsiderations that we outline in each respect-ive section.

Text Preprocessed Summarization

Text preprocessing can be defined as bringing “yourtext into a form that is predictable and analyzable foryour task” (Ganesan 2019). It is almost always the firststep when preparing text for advanced analytics tech-niques such as machine learning, but it can also beused on its own to understand what is being discussedwithin a corpus of text. Using text preprocessing tosummarize text could be considered a naive approachto topic detection in social media text, but it is almostalways a proper starting place to understand what isbeing talked about within a grouping of social mediaposts. Our previously stated definition of “topic” sug-gests that short summarized descriptions of a text tohelp understand the larger whole qualifies as topic

detection. Text preprocessed summarization useswords that occur within social media posts as proxiesfor topics.

In their book on text mining, Miner et al. (2012)outline the general steps of preprocessing text: choosethe scope of text to be processed, tokenize, removestop words, stem, normalize spelling, detect sentenceboundaries, and normalize casing. Depending on thecontext of what they are trying to extract from text,researchers can include/exclude some of these generalsteps according to their research objective. The stepsthat we followed to analyze the social media posts forthis article are as follows:

1. Collected social media posts. Although a specifichow-to on collecting and aggregating social mediaposts is beyond the scope of this article, research-ers who have a limited background in this areacould consider some existing tools that do notrequire extensive programming to collect socialmedia data (Smith et al. 2009; Yun et al. 2019).We pulled the tweets for our analyses using theTweepy Python package (Roesslein 2009), butsome alternative considerations would be thetwitteR R package (Gentry 2015) or using SMM.SMM leverages the Tweepy Python package.

2. Tokenized the text from the social media posts.Tokenization is the process of reducing text intosmaller chunks (tokens). We suggest using white-space tokenization to separate social media postsinto separate words. For example, a tweet mayread “I love Fridays!” White-space tokenizationafter our previous preprocessing step wouldreduce the tweet to “i” and “love” and “fridays.”Other more nuanced forms of tokenization exist(e.g., Hochreiter and Schmidhuber 1997; Trim2013), but we focus on the most common tech-nique for this article. We used Python’s NLTK(Natural Language Toolkit) package (Loper andBird 2002) for this step, as well as for steps 3through 5.

3. Normalized the text from social media posts.Normalization involves tasks such as convertingall text from uppercase to lowercase, removingnumbers (unless numbers mean something withinthe context of the topical analyses), and removingpunctuation.

4. Removed stop words from social media posts. Stopwords are the most common words that appearin a text but are not actually important to under-standing the topical content of what is being dis-cussed (e.g., the, with, to, a, and).

JOURNAL OF INTERACTIVE ADVERTISING 3

Page 4: Computationally Analyzing Social Media Text for Topics: A Primer …sundaram.cs.illinois.edu/pubs/2019/2019yun_primer.pdf · 2019-12-31 · In their book on text mining, Miner et

5. Stemmed all the remaining words within socialmedia posts. Stemming is the process of reducingwords to their word stems. For example, thewords walking, walked, and walk would all stemto the word walk.

6. Sorted the remaining words in descending order offrequency of the words. Sorting in this way willidentify the most important words (topics) withinthe body of posts.

We built all of these steps into the SMM’s natural lan-guage processing function within SMM’s SMILE(Social Media Intelligence and Learning Environment)tool. SMILE’s natural language processing function isa graphical interface presentation of our NLTKPython code, which was created for researchers whoare not experts at coding.2

We ran this text preprocessed summarization onFitbit’s Twitter timeline snapshot taken in October 2017using the SMM’s SMILE tool (Yun et al. 2019). Wechose Fitbit’s Twitter timeline because Yun (2018) pro-vided human-coded topic analyses of six brands’ Twittertimelines (Fitbit, Wyndham Hotels, Royal Caribbean,American Heart Association, National Rifle Association,and World Wildlife Foundation), as well as collecting allof the brands’ postings using the Twitter public applica-tion programming interface (API) from September 6through 20, 2017. For the sake of consistency and easeof comparison across so many different methods, wechose to focus on only one of the brand’s Twitter time-lines: Fitbit. Within Yun’s (2018) Fitbit data set, Fitbit’sTwitter timeline data included 3,200 Fitbit tweets.Twitter’s public API limits timeline downloads to themost recent 3,200 tweets. An example of the top topicsfor Fitbit’s social media posts using text preprocessedsummarization is found in Table 1.

Text preprocessed summarization is usually a greatplace to understand topics of conversation withinsocial media posts because the computational com-plexity is low and the computing resources requiredare low. With this said, one of the most apparent lim-itations to text preprocessed summarization in discov-ering topics of conversation within social media isthat phrases will sometimes be unnaturally separatedby the tokenization process (e.g., “united” and “states”would be separated). Therefore, we turn to the nextmethod of discovering topics within social mediaposts: phrase mining.

Phrase Mining

The primary limitation of text preprocessed summar-ization is that it relies on a bag-of-words assumption,

which is simply that each individual word is handledas isolated from the other words, and multiwordphrases (also referred to as n-grams) are separated.For example, the phrase “United States” would be sep-arated out as “united” and “states” in our previousapproach. One way to address this issue is to prede-fine the size of the tokens prior to tokenization, thuschoosing to make all the tokens 2-grams (two-wordtokens) or 3-grams (three-word tokens), for example,but this would clearly cause problems because manywords are supposed to be single words. Another prob-lem with the bag-of-words approach is that phrasesthat include stop words and punctuation may beincredibly important for understanding the overallconversation.

Brown et al. (1992) introduced a concept theycalled “sticky pairs,” in which they used a statisticalformula to calculate the probability of whether twowords are most likely supposed to be paired togetheras a phrase or handled separately as two distinctwords. In one analysis, they used a 59,537,595-wordsample of text from the Canadian Parliament andfound that the words “humpty” and “dumpty” occurtogether as “humpty dumpty” 6,000,000 times morefrequently than one would expect from the individualfrequencies of “humpty” and “dumpty.” Although thiswas a major step forward, this type of statistical wordcomparison is quite computationally expensive, as itrequires parsing through all 59,537,595 words and cal-culating the probabilities of each word occurring nextto another word, and it is also limited by focusingsolely on 2-grams. This also does not address phrases

Table 1. Top 20 most common words of Fitbit’sTwitter posts.Topic of Conversation Number of Occurrences

Happystep 462Step 418fitbit 347goal 221fitbitfriend 209hear 207great 201fit 193job 169awesom 160congrat 159make 148tip 137day 136tracker 123good 119workout 119share 115work 112love 108

4 J. T. YUN ET AL.

Page 5: Computationally Analyzing Social Media Text for Topics: A Primer …sundaram.cs.illinois.edu/pubs/2019/2019yun_primer.pdf · 2019-12-31 · In their book on text mining, Miner et

that include stop words and punctuation that areimportant for understanding the conversationwithin text.

More advanced statistical techniques in detectingphrases have since been published that handle varyinglengths of n-grams (e.g., Deane 2005) and eveninclude stop words (e.g., Parameswaran, Garcia-Molina, and Rajaraman 2010), but, as mentioned byLiu et al. (2015), many of these purely statistical tech-niques still suffer from other limitations to detectingquality phrases. For example, what if the word justwas part of the larger phrase “just do it”? Manyphrase-mining techniques would struggle with differ-entiating these types of situations. Recently, Shanget al. (2018) built on previous phrase-mining workfocusing on the popularity, informativeness, and inde-pendence of phrases. Their goal was to create an auto-mated phrase-mining method that was “domain-independent, with minimal human effort or relianceon linguistic analyzers” (Shang et al. 2018, p. 1825).The result of their work is a framework calledAutoPhrase (with associated code found at https://github.com/shangjingbo1226/AutoPhrase). At the coreof AutoPhrase is the use of existing databases ofhuman-created quality phrases, such as Wikipedia, torefine their phrase detection. For example, they mightdetect two distinct phrases within a body of text:“Barack Obama” and “this is.” Wikipedia has an entryfor “Barack Obama” but not for “this is.” Thus, thephrase “Barack Obama” would be weighted higher intheir results as compared to “this is.” AutoPhrase alsoincorporates parts of speech (POS) tagging (i.e., sepa-rating text into various parts of speech, such as nounsand verbs) to contribute to the weighting of results.An example of the top topics for Fitbit’s social mediaposts using phrase mining is found in Table 2.

Phrase mining provides a more complete picture oftopics being discussed within social media posts ascompared to text preprocessed summarization, butmany of the phrase-mining algorithms require moreadvanced levels of computer programming knowledge.We recently incorporated AutoPhrase into SMM, thusallowing for programming-free access to the frame-work.3 As far as we know, phrase mining has notbeen applied to advertising studies or practice as ofyet, but we believe this is a promising new techniqueto consider due to the probabilistic determination ofphrases and the importance of understanding phrases,such as taglines, for advertising research.

Phrase mining is an elegant and powerful way toanalyze topics within social media posts, but one ofits major limitations is that while it can collect

phrases, it cannot assess larger topics of conversationand themes that are not directly written in the posts.For example, if the phrases “united states,” “congress,”and “supreme court” were found within social mediaposts, the higher-level topic that is being discussed ismost likely “U.S. government.” This is a scenario inwhich topic modeling is more appropriate to address.

Topic Modeling

Topic modeling is a method to summarize a large cor-pus of documents to provide a quick summary ofwhat the documents are about (Blei, Ng, and Jordan2003). The most common form of topic modeling isLDA topic modeling, which assumes that there are aset number of latent topics across a collection ofdocuments, and a probabilistic equation is used toallocate words to each latent topic. This is one of themajor differences between topic modeling versus textpreprocessed summarization and phrase mining; topicmodeling does not actually output topics but ratheroutputs words from the documents that are related toone another (e.g., instead of outputting a topic suchas “Burger King,” a topic modeling exercise wouldoutput a grouping of words such as “whopper, flame-broiled, nuggets”). Topic modeling simply groupswords together that probabilistically belong together,and the researcher needs to review the word clustersto determine what the underlying topic is. This isessentially a coding exercise of determining topics,which is not a trivial task; and proper coding of topicsshould be based on theoretical knowledge and con-text-specific expertise (Humphreys and Wang 2018;

Table 2. Top 20 AutoPhrase phrase-mining topics of Fitbit’sTwitter posts.Topic of Conversation AutoPhrase Confidence Score

Red carpet 0.9731376Enrollment program 0.96542927Personal trainer 0.96466617Peanut butter 0.96342927[Unsupported characters]a 0.96067927Ultramarathon man 0.95777018Case number 0.95597927Stay tuned 0.9545134Water resistant 0.9537634Losing weight 0.95368565Woody scal 0.95216303Bay area 0.95116938Corporate wellness 0.94877018Slow cooker 0.9487499Weight gain 0.94727732Continuous heart rate 0.94268006Weight loss 0.94174546Wireless headphones 0.94141542Alta hr 0.94097494Yoga poses 0.94007653aAutoPhrase parses English words only; thus, characters such as emojisand other nonsupported Unicode characters do not report correctly.

JOURNAL OF INTERACTIVE ADVERTISING 5

Page 6: Computationally Analyzing Social Media Text for Topics: A Primer …sundaram.cs.illinois.edu/pubs/2019/2019yun_primer.pdf · 2019-12-31 · In their book on text mining, Miner et

Salda~na 2009). Some recent research using topic mod-eling in the context of marketing and business ana-lytics are Liu, Burns, and Hou (2017) and Liu (2019).

Topic modeling assumes there are many documentswithin a corpus and that each document has manywords. The problem with applying topic modeling tosocial media posts is the question of what to considera document. If you decide to consider each individualpost a document, then data within each post willlikely be insufficient to adequately do what topic mod-eling was originally conceptualized for. More recenttopic modeling methods created specifically forTwitter have attempted to work around this limitationthrough various strategies, such as considering eachconversation string between Twitter users as separatedocuments (Alvarez-Melis and Saveski 2016), consid-ering all the tweets that mention the same hashtag adocument (Mehrotra et al. 2013), or considering allthe tweets from the same author a document (Hongand Davison 2010).

An example of the word clusters and projectedtopics for Fitbit using LDA topic modeling is foundin Table 3. For LDA topic modeling, the number oflatent topics must be predefined before running theclustering algorithm, because LDA topic modelingcannot estimate how many latent topics are within aseries of documents on its own. For the topics inTable 3, we ran LDA topic modeling multiple timeswith different numbers of topic clusters on Fitbit’ssocial media posts, considering all tweets from thesame author as a document (Hong and Davison2010), and found the most intuitive results with thevalue of five total latent topic clusters. We conductedthis topic modeling using the genism Python library(�Rehu�rek and Sojka 2011), and we also built our codeinto SMM’s topic modeling function for researcherswho would like to use that tool.4

One of the weaknesses of topic modeling is thatmuch of the work is ultimately done by humans indetermining what topics are indicated by variousword clusters. This human-decision part of the topicmodeling process is not typically documented andsaved in a format that can be assessed or used byfuture researchers. This is where supervised machine-learned text classification can help.

Supervised Machine-Learned Text Classification

Supervised machine learning is “the search for algo-rithms that reason from externally supplied instancesto produce general hypotheses, which then make pre-dictions about future instances” (Kotsiantis 2007, p.249). One of the most common tasks for machinelearning is the task of classification, which is the pro-cess of applying meaningful labels to data to helpmake sense of that data. Supervised machine learningfor classification can be applied to text, especiallywhen the goal is to classify (or categorize) text accord-ing to a predefined goal. At a basic level, the stepsinvolved are (1) identification of required data andpreprocessing the data (using techniques mentionedin the text preprocessed summarization section); (2)determining which part of the data will be used as thetraining set from which the computer learns patterns(each data point must be labeled or categorized byeither a human coder or an automated labeling pro-cess); (3) selecting which machine-learning algorithmwill be used for the computer to learn patterns withinthe labeled data training set; and finally (4) allowingthe algorithm(s) to predict a test set of data that hasalso been labeled to evaluate how well the algo-rithm(s) can guess the prelabeled test set(Kotsiantis 2007).

This same process can be applied to social mediaposts in which humans label posts into topic categoriesbased on theoretical knowledge and context-specificexpertise (Salda~na 2009). Thus, supervised machine-learned text classification brings together the best ofboth worlds in leveraging human labeling of topics andpairing this with a computer’s ability to find the statis-tical relationships between words in social media postsand topics. We do not provide an example of topicsusing supervised machine-learned text classification onthe Fitbit social media posts in this section because thisclassification is dependent on an initial phase ofhuman-labeled topic training.5 For additional reading,some recent studies that have used supervised machinelearning within the realm of advertising and marketingare Vermeer et al. (2019) and Okazaki et al. (2015).

Supervised machine-learned text classification isexcellent for creating topic categories that can betrained as granularly as a researcher may want to train

Table 3. LDA topic modeling topics of Fitbit’s Twitter posts.Topic of Conversation Words in LDA Cluster

Marketing Fitbit happystep, step, fitbitfriend, know, fitbit, thank, good, awesome, happi, comeExercise goal, great, happystep, step, goalday, crush, challeng, work, like, hearExercise encouragement congrat, awesome, share, congratul, workout, love, fitbit, cook, free, buttHealthy lifestyle fit, tip, healthi, time, stay, health, help, weight, fitbit, sleepFitbit customer service fitbit, hear, sorri, help, hope, thank, tracker, email, phhrmhlmt, check

6 J. T. YUN ET AL.

Page 7: Computationally Analyzing Social Media Text for Topics: A Primer …sundaram.cs.illinois.edu/pubs/2019/2019yun_primer.pdf · 2019-12-31 · In their book on text mining, Miner et

them, but the downside is that the process of hand-coding a set of social media posts large enough toadequately train a machine-learned model can beextremely time-consuming. Thus, we turn to our finalmethod of detecting topics within social media posts:semantic topic tagging. This method leverages a com-bination of techniques borrowed from many of thepreviously mentioned topic detection methods.

Semantic Topic Tagging

Semantic topic tagging is “the extraction and disam-biguation of entities and topics mentioned in orrelated to a given text” (Jovanovic et al. 2014, p. 39).At the heart of semantic topic tagging is the recogni-tion that humans have already prelabeled large num-bers of topics (called concepts within semantictagging) in public Web repositories, such asWikipedia. This is similar to phrase mining, whichhas leveraged Wikipedia entry titles to better rank theimportance of existing phrases within social mediaposts (Liu et al. 2015), but semantic topic taggingextends the usage of Wikipedia-like repositories byalso using data such as whole description wordswithin Wikipedia entries themselves. The typicalsemantic topic tagging process for social media postsinvolves text preprocessing of the posts (e.g., “Try anImpossible Whopper!” to “try” “impossible”“whopper”; much like text preprocessed summariza-tion), semantic similarity probabilistic calculations ofthe posts to determine n-grams and phrases usingrepositories like Wikipedia (much like phrase mining),and machine-learning techniques using the content ofWikipedia to help predict topics being discussed (asexplained in supervised machine-learned text classifi-cation). Various frameworks exist for applying seman-tic topic tagging to text (for a comprehensive list, seeJovanovic et al. 2014), but two frameworks that havebeen proven to work well with microblog social mediaposts are TAGME (Ferragina and Scaiella 2010) andthe work of Meij, Weerkamp, and De Rijke (2012).TAGME is currently presented as a pretrained topictagger that can be accessed via API (Parameswaran,Garcia-Molina, and Rajaraman 2010).6 Examples ofthe topics tagged for Fitbit using semantic topic tag-ging are provided in Table 4.

To the best of our knowledge, semantic topic tag-ging has not yet been incorporated into advertisingand marketing studies, but Jovanovic et al. (2014) dis-cussed how semantic topic tagging is currently beingused in contextual advertising “to enable better posi-tioning of advertisements on webpages based on the

semantics of the main content of the page” (p. 39).Semantic topic tagging should almost always be pre-ferred for detecting topics of conversation in socialmedia posts because it leverages many strengths ofother topic discovery methodologies, but there are stillimportant weaknesses to take into consideration. Oneof the biggest weaknesses of semantic topic tagging isits heavy reliance on the public repositories that wereused to train the topics. For example, with TAGME,the topic tagger was trained with data from aWikipedia snapshot taken on November 6, 2009.Wikipedia has clearly evolved and changed from 2009to current day, and many of the topics that are onWikipedia now may not have existed in concept in2009. For example, Kendall Jenner, a currently popu-lar social media influencer, did not have a Wikipediaentry until 2010.

Computational Topic Methods versus Human-Coded Topics

Our next step was to compare each topic detectionmethod against a human-coded baseline, as human-coded content analyses of social media data has beenused successfully in previous studies (e.g., Chen et al.2015; Kwon and Sung 2011; Lin and Pe~na 2011). Forthis study, we chose an available data set with topicsalready coded from a study that looked at Twittertimelines from brands and charities (Yun 2018). Yun(2018) used traditional content analysis techniques(e.g., Harwood and Garry 2003; Krippendorff 2013;Skalski, Neuendorf, and Cajigas 2017) to analyzeTwitter timelines of the brands and charities of Fitbit,

Table 4. Top 20 TagMe semantic topic tagging of Fitbit’sTwitter posts.Topic of Conversation Number of Occurrences

Fitbit 562Physical fitness 105Physical exercise 90E-mail 64Health 57“Woohoo” (Christina Aguilera song) 55Motivation 43Fun 43U.S. dollar 39Billboard 200 35Steps (group) 34“With You” (Chris Brown song) 34The Who 33Deutsche Mark 33Calorie 32Heart rate 31Today (U.S. TV program) 31Recipe 30Thanks for sharing 26For good 26

JOURNAL OF INTERACTIVE ADVERTISING 7

Page 8: Computationally Analyzing Social Media Text for Topics: A Primer …sundaram.cs.illinois.edu/pubs/2019/2019yun_primer.pdf · 2019-12-31 · In their book on text mining, Miner et

Wyndham Hotels, Royal Caribbean, American HeartAssociation, National Rifle Association, and WorldWildlife Foundation. He recruited two full-timeemployees from a large midwestern university to con-duct the content analyses.

The first content analysis phase was an effort todecide which topics should be included in the set oftopics that they would look for when reviewing theTwitter timelines for the brands and the causes. Yun(2018) adopted a provisional coding approach(Salda~na 2009), which begins the coding process witha “start list” of potential codes to use as a base priorto coding. His start list was taken from surveyresponses where participants were asked to list whattopics they believed each brand and cause would talkabout. After reviewing the suggested topics across allthe brands and causes, it was determined that therewas consistency in responses for the following topics:business travel, charitable giving, climate change,cruises, customer service, diet, environmentalism,exercise, fashion, firearms, gun politics, gun violence,health, heart, hotels, hunting, marketing, medicine,

nature, oceans, religion, sports, technology, vacations,weather, and wildlife. He then presented each coderwith the Twitter timelines from the three brands(Fitbit, Royal Caribbean, and Wyndham Hotels) andthe three charities (American Heart Association,National Rifle Association, and World WildlifeFoundation) separately as HTML web pages. Upondiscussion, the coders suggested that 100 to 150 tweetswere as much as they could process visually and cog-nitively when attempting to look for topics being dis-cussed, so 150 tweets from each brand and each causewere chosen via a random selection script and pre-sented as separate HTML pages. The coders reviewedeach Twitter timeline and indicated on a separatespreadsheet the topics they believed were being dis-cussed from the previously constructed start list oftopics. They also were given the opportunity to writein suggestions of other topics being discussed thatwere not covered by the provided start list of topics.Upon completion of this coding task, Yun (2018) rana reliability analysis and found substantial agreement(j ¼.70) between the two coders according to Vieraand Garrett (2005). In addition, after discussion ofpotential topics that were not part of the original startlist, the coders came to a joint conclusion that thestart list was comprehensive enough without anynecessary topics missing. However, they did indicatethat two topics from the start list (business travel andreligion) did not seem to be discussed across the threebrands and the three causes. Table 5 contains anexample of the topics identified by the results fromthe two human coders. Table 6 presents the resultsfrom each method (excluding the general supervised

Table 5. Human con-tent analysis of Fitbit’sTwitter posts.Content of Posts

Charitable givingCustomer serviceDietExerciseFashionHealthMarketingTechnology

Table 6. Results comparison for one brand’s (FitBit) topics as determined by each method.Human Coded Text Preprocessed Summarization Phrase Mining Topic Modeling Semantic Topic Tagging

Charitable giving happystep red carpet Marketing Fitbit FitbitCustomer service step enrollment program Exercise Physical fitnessDiet fitbit personal trainer Exercise encouragement Physical exerciseExercise goal peanut butter Healthy lifestyle EmailFashion fitbitfriend [Unsupported characters] Fitbit customer service HealthHealth hear ultramarathon man Woohoo (Christina Aguilera song)Marketing great case number MotivationTechnology fit stay tuned Fun

job water resistant United States dollarawesom losing weight Billboard 200congrat woody scal Steps (group)make bay area With You (Chris Brown song)tip corporate wellness The Whoday slow cooker Deutsche Marktracker weight gain Caloriegood continuous heart rate Heart rateworkout weight loss Today (U.S. TV program)share wireless headphones Recipework alta hr Thanks for Sharinglove yoga poses For Good

8 J. T. YUN ET AL.

Page 9: Computationally Analyzing Social Media Text for Topics: A Primer …sundaram.cs.illinois.edu/pubs/2019/2019yun_primer.pdf · 2019-12-31 · In their book on text mining, Miner et

Table7.

Social

media

topicdetectiondecision

matrix.

Metho

dHow

DoesThisMetho

dWork?

WhenIsThisMetho

dAp

prop

riate?

How

CanIE

xecute

ThisMetho

d?Ad

vantages

Disadvantages

Text

preprocessed

summarization

Thismetho

dbreaks

downeach

social

mediapo

stto

stem

med

words,rem

oves

stop

words,and

ranksby

wordfrequencyacross

allp

osts.

Itisalmostalwaysaprop

erstartin

gpo

intto

quickly

view

thetext,b

utno

tapprop

riate

tobe

used

asthesole

means

for

detectingtopics.

Vario

usPython

andR

packages/cod

eareavailable

(e.g.,Beno

itet

al.2

018;

LoperandBird

2002);SM

Mincorporates

thismetho

dwith

outtheneed

forcoding

.

Itisvery

simpleto

runvia

availablepackages

andfor

themostpartdo

esno

ttake

many

compu

tatio

nalresou

rces.

Due

tonu

merou

slim

itatio

nssuch

asmishand

lingof

phrases,thismetho

dalon

ecann

otbe

trustedas

afin

aldeterm

inantof

topics.

Phrase

mining

Thismetho

dalgo

rithm

ically

discerns

which

words

shou

ldremaintogether

asph

rasesandranksby

phrase

certaintyacross

allp

osts.

Itisapprop

riate

when

researchersbelieve

that

topics

areaccurately

representedin

social

media

postsas

explicitly

mentio

nedph

rases.

AutoPh

rase’scode

canbe

foun

dviaGitH

ub(Shang

etal.2

018);alsocanbe

used

viaSM

M.

Itisamajor

upgradeto

text

preprocessed

summarization

that

stillallowsforfully

automated

topicdetection.

Itisnativelyabitcomplex

torunviaoriginal

codebase,

althou

ghitiseasier

torun

viaSM

M.

Topicmod

eling

Thismetho

dalgo

rithm

ically

aggregates

words

from

social

mediapo

ststhat

are

relatedto

oneanother,

depend

ingon

the

researchersto

then

qualitativelydefin

etopics

from

thewordclusters.

Itisapprop

riate

whentopics

areno

texplicitlymentio

ned

inthepo

sts,and

compu

tatio

nalh

elpto

aggregatetopically

related

words

isdesiredbefore

qualitativelyassign

ing

topics

byhu

man

coding

.

Vario

usPython

andR

packages/cod

eareavailable

(e.g.,Mabey

2015;� R

ehu� rek

andSojka2011);also

can

beused

viaSM

M.

Itisanice

balancebetween

human

contentanalysisand

automated

help

inredu

cing

words

toanalyzewhen

coding

topics.

Thequ

ality

oftopics

isdepend

enton

theskilllevel

ofthehu

man

coder(s).

Supervised

machine

learned

text

classification

Thismetho

dtakessocial

mediapo

ststhat

have

been

human

codedinto

topics

anduses

them

totraina

mathematical

mod

elin

which

acompu

tercanthen

classify

future

posts.

Itisapprop

riate

whendo

main-

specifictopics

have

been

predefined

andcodedby

humansandresearchers

subsequentlywantto

detect

thosepredefined

topics

from

social

mediapo

sts.

Themostrecogn

ized

package

formachine

learning

isPython

’ssklearn(Pedrego

saet

al.2

012);alsocanbe

used

viaSM

M.

Itallowsforthehigh

estlevel

offin

e-tuning

oftopics.

Itisextrem

elytim

e-and

resource-con

sumingto

train

topics

correctly.

Semantic

topictagg

ing

Thismetho

duses

repo

sitories

likeWikipedia

asalarge-

scale,hu

man-cod

edtraining

setof

topics

toalgo

rithm

icallylabelsocial

mediapo

stswith

topics.

Itisapprop

riate

when

repo

sitorieslikeWikipedia

containtheapprop

riate

topics

andareup

dated

enou

ghto

beconsidered

aprop

ercompu

tatio

nal

training

setfordiscovering

topics

insocial

media

posts.

TAGMEiscurrently

themost

approp

riate

tool

forshort

textssuch

associal

media

andhasan

APIthat

isfunctio

ning

(Jovanovic

etal.2

014).

Itdraw

son

thevast

crow

dsou

rced

know

ledg

ebase

ofWikipedia,w

hich

allowsforvery

up-to-date

topicpo

ssibilities.

Wikipedia

canbe

editedand

updatedby

anyone;

therefore,thereisthe

possibility

forerroneou

scatego

rizationof

topics.

JOURNAL OF INTERACTIVE ADVERTISING 9

Page 10: Computationally Analyzing Social Media Text for Topics: A Primer …sundaram.cs.illinois.edu/pubs/2019/2019yun_primer.pdf · 2019-12-31 · In their book on text mining, Miner et

machine-learned text classification) side by sidefor comparison.

As seen in Table 6, the only methods that includedany overlap in topic phrases from the human codingwere topic modeling and semantic topic tagging. Eachmethod of topic detection has unique nuances thatstem from how it determines topics from social mediatext. This emphasizes the importance for researchersand practitioners to understand how computationalalgorithms work, as using them with a misunderstand-ing of how they determine the output candramatically affect research results. Thus, rather thana one-size-fits-all approach to analyzing social mediatext, researchers should consider their goals and thestrengths and weaknesses of the various methods con-sidering those goals.

When to Use Which Computational Methods

For quick reference, a brief comparison of methodconsiderations is presented in Table 7. One of the firstthings researchers and practitioners should considerwhen analyzing topics of social media posts computa-tionally is whether they believe the actual topics ofconversation are explicitly mentioned in the poststhemselves. If this is the case, then phrase mining isan appropriate choice. However, running a text pre-processed summarization analysis is a good first stepbefore running a more computationally expensivephrase-mining exercise. Phrase mining on large bodiesof social media posts can take substantial computa-tional resources to run, whereas text preprocessedsummarization can usually be achieved with minimalcomputational resources. Unfortunately, text prepro-cessed summarization suffers too greatly from theissue of segmented phrases to be useful as the onlymeans of detecting topics of conversation withinsocial media posts. However, when comparing textpreprocessed summarization and phrase mining to thehuman-coded topics in Table 6, it is clear that theway actual people may conceive of topics is differentfrom what may explicitly be mentioned in the socialmedia posts.

Topic modeling forms somewhat of a hybridbetween text preprocessed summarization and humancoding of topics. As can be seen in Table 3, LDAtopic modeling helps aggregate stemmed words thatbelong together, but it is up to the researcher or prac-titioner to determine the underlying latent topic. Weassigned the LDA word clusters to the topics of mar-keting Fitbit, exercise, exercise encouragement, healthylifestyle, and Fitbit customer service, but this is clearly

a subjective assignment process. Therefore, having asound qualitative coding process is important for thehuman-coding step of topic modeling. Ourrecommendation with topic modeling is that it ismore useful as a preparatory step for supervisedmachine-learned text classification. Topic modelinghelps provide a sense of what chunks of words couldform certain topics that are emerging from the posts.After a proper coding scheme is developed, research-ers can then code a larger number of social mediaposts for a hyperspecific supervised machine-learnedtopic classifier for the exact topics they are seeking.Semantic topic tagging then becomes just one exampleof a detection algorithm that is built on shared princi-ples of supervised machine-learned text classification(in fact, some semantic topic taggers using supervisedmachine learning; e.g., Meij, Weerkamp, and DeRijke 2012).

Thus, it would benefit researchers and practitionersto consider running through most, if not all, compu-tational topic detection methods to gain a greaterunderstanding of what is being discussed within theiraggregated social media posts. The purpose of the textanalysis and the project’s research questions shouldthen inform the best approach to finally choose. Amajor consideration for researchers should be therealization that there is quite a bit of subjectivity withthese computational methods, as there is a tendencyto believe that computational methods are objective innature. In addition, an important future research dir-ection is the creation of tools and environments thatcan reduce the barrier of entry to these kinds of com-putational methods. Just as web-authoring technologywent through an evolution of direct HTML-basedcoding to software like Dreamweaver to most recentlywebsites like Squarespace and Wix, computationaldata science methods need to be made more accessiblevia efforts such as SMM to enable researchers to focusmore on their research questions, as opposed to need-ing to focus on becoming an expert in any given algo-rithmic computational method.

Concluding Remarks

In this article, we discussed various ways of computa-tionally detecting topics of conversation in socialmedia text, namely, text preprocessed summarization,phrase mining, topic modeling, supervised machine-learned text classification, and semantic topic tagging.We also compared those methods against human cod-ing of topics on social media brand posts and pre-sented a matrix that researchers and practitioners can

10 J. T. YUN ET AL.

Page 11: Computationally Analyzing Social Media Text for Topics: A Primer …sundaram.cs.illinois.edu/pubs/2019/2019yun_primer.pdf · 2019-12-31 · In their book on text mining, Miner et

use to make decisions when they want to analyzesocial media posts for topics of conversation. Whileall methods are currently used for analyzing what isbeing discussed on social media, the results of eachshow that there are significant differences in what aresearcher might conclude is being discussed based onthe method used. Rather than leading to standardizedor objective outcomes, we can see the role of subject-ivity and the strengths and weaknesses of the variousmethods. Thus, researchers should start with theirresearch question. We also referenced advertising andmarketing papers that have used the various methodsso that researchers can get a more in-depth illustra-tion of each. Finally, we included the code used foreach analysis reported in this article so that research-ers can assess and use that code themselves. In add-ition, we built that code into an existing open-source,free environment for conducting social media dataanalysis (http://socialmediamacroscope.org) to allowcomputationally nonexpert advertising researchers totry the various methods out directly for their ownresearch questions. Although this article was notmeant to be an exhaustive list of topic detectionmethods and considerations, it should function as astrong starting point for advertising researchers andpractitioners interested in detecting topics in socialmedia posts.

Notes

1. While we provide details and code for those who wouldlike to use it, we wanted to ensure that researchers whodo not have backgrounds in computer science orcoding would be able to follow along or conduct theirown research without needing to gain significantadditional technological expertise. Therefore, wereference a tool—Social Media Macroscope (SMM)—numerous times throughout this article. SMM is ascience gateway that allows researchers withoutcomputer science backgrounds to execute open-sourcedata science analytic methods without the need to code,and use of this gateway is free for academic andnonprofit use (Yun et al. 2019). The methods detailedin this article do require some knowledge of coding;however, SMM can be used as an alternative option forthose who are not comfortable with coding. Therefore,we built our code into the SMM project, as well asproviding links to our direct code throughout thearticle for researchers who want to apply the code apartfrom SMM. All code for SMM can be found at https://opensource.ncsa.illinois.edu/bitbucket/projects/SMM.

2. For researchers who would like to run the Python codethemselves, all associated code for this method can befound at https://opensource.ncsa.illinois.edu/bitbucket/projects/SMM/repos/smm-analytics/browse/lambda/lambda_preprocessing_dev/preprocessing.py.

3. Researchers desiring to run our code/scripts themselvescan access the Dockerized script that we used to runAutoPhrase at https://opensource.ncsa.illinois.edu/bitbucket/projects/SMM/repos/smm-analytics/browse/batch/smile_autophrase/dockerfile.

4. For researchers interested in the Python code, it can befound at https://opensource.ncsa.illinois.edu/bitbucket/projects/SMM/repos/smm-analytics/browse/batch/batch_topic_modeling/gensim_topic_modeling.py.

5. Machine-learned text classification is often conductedusing Python’s sklearn package (Pedregosa et al. 2012),but we have also built in text classification to SMM.Our code can be found at https://opensource.ncsa.illinois.edu/bitbucket/projects/SMM/repos/smm-analytics/browse/batch.

6. We used the TAGME API to conduct our semantictopic tagging. All documentation on how to call theTAGME API can be found at https://sobigdata.d4science.org/web/tagme/tagme-help.

ORCID

Joseph T. Yun http://orcid.org/0000-0001-6875-4456Brittany R. L. Duff http://orcid.org/0000-0002-3206-0353

References

Alvarez-Melis, D., and M. Saveski (2016), “Topic Modelingin Twitter: Aggregating Tweets by Conversations,” pre-sented at the 10th International AAAI Conference onWeb and Social Media, Cologne, Germany, May.

Benoit, K., K. Watanabe, H. Wang, P. Nulty, A. Obeng, S.M€uller and A. Matsuo (2018), “quanteda: An R Packagefor the Quantitative Analysis of Textual Data,” Journal ofOpen Source Software, 3 (30), 774.

Blei, D.M., A.Y. Ng, and M.I. Jordan (2003), “LatentDirichlet Allocation,” Journal of Machine LearningResearch, 3 (4–5), 993–1022.

Boumans, J.W., and D. Trilling (2016), “Taking Stock of theToolkit: An Overview of Relevant Automated ContentAnalysis Approaches and Techniques for DigitalJournalism Scholars,” Digital Journalism, 4 (1), 8–23.

Brown, P.F., P.V. deSouza, R.L. Mercer, V.J. Della Pietra,and J.C. Lai (1992), “Class-Based n-Gram Models ofNatural Language,” Computational Linguistics, 18 (4),467–79.

Chen, K.-J., J.-S. Lin, J.H. Choi, and J.M. Hahm (2015),“Would You Be My Friend? An Examination of GlobalMarketers’ Brand Personification Strategies in SocialMedia,” Journal of Interactive Advertising, 15 (2), 97–110.

Daniel, E.S., Jr., E.C. Crawford Jackson, and D.K.Westerman (2018), “The Influence of Social MediaInfluencers: Understanding Online Vaping Communitiesand Parasocial Interaction through the Lens of Taylor’sSix-Segment Strategy Wheel,” Journal of InteractiveAdvertising, 18 (2), 96–109.

Deane, P. (2005), “A Nonparametric Method for Extractionof Candidate Phrasal Terms,” presented at the 43rdAnnual Meeting of the Association for ComputationalLinguistics, Ann Arbor, Michigan, June.

JOURNAL OF INTERACTIVE ADVERTISING 11

Page 12: Computationally Analyzing Social Media Text for Topics: A Primer …sundaram.cs.illinois.edu/pubs/2019/2019yun_primer.pdf · 2019-12-31 · In their book on text mining, Miner et

Fan, W., and M.D. Gordon (2013), “The Power of SocialMedia Analytics,” Communications of the ACM, 57 (6),74–81.

Ferragina, P., and U. Scaiella (2010), “TAGME: On-the-FlyAnnotation of Short Text Fragments (by WikipediaEntities),” presented at the 19th ACM InternationalConference on Information and Knowledge Management,Toronto, Canada, Octobe.

Fulgoni, G.M. (2015), “How Brands Using Social MediaIgnite Marketing and Drive Growth: Measurement ofPaid Social Media Appears Solid, But Are the Metrics forOrganic Social Overstated?,” Journal of AdvertisingResearch, 55 (3), 232–36.

Ganesan, K. (2019), “All You Need to Know about TextPreprocessing for NLP and Machine Learning,”KDnuggets, April, https://www.kdnuggets.com/2019/04/text-preprocessing-nlp-machine-learning.html.

Gentry, J. (2015), “twitteR,” https://www.rdocumentation.org/packages/twitteR/versions/1.1.9.

Harwood, T.G., and T. Garry (2003), “An Overview ofContent Analysis,” The Marketing Review, 3 (4), 479–98.

Hochreiter, S., and J. Schmidhuber (1997), “Long Short-Term Memory,” Neural Computation, 9 (8), 1735–80.

Hong, L., and B. Davison (2010), “Empirical Study of TopicModeling in Twitter,” Proceedings of the First Workshopon Social Media Analytics, New York: ACM, 80–88.

Humphreys, A., and R.J.-H. Wang (2018), “Automated TextAnalysis for Consumer Research,” Journal of ConsumerResearch, 44 (6), 1274–1306.

Jovanovic, J., E. Bagheri, J. Cuzzola, D. Gasevic, Z. Jeremic,and R. Bashash (2014), “Automated Semantic Tagging ofTextual Content,” IT Professional, 16 (6), 38–46.

Kietzmann, J., J. Paschen, and E. Treen (2018), “ArtificialIntelligence in Advertising: How Marketers Can LeverageArtificial Intelligence along the Consumer Journey,”Journal of Advertising Research, 58 (3), 263–67.

Kotsiantis, S.B. (2007), “Supervised Machine Learning: AReview of Classification Techniques,” Informatica, 31(2007), 249–68.

Krippendorff, K. (2013), Content Analysis: An Introductionto Its Methodology, 3rd ed., Thousand Oaks, CA: Sage.

Kwon, E.S., and Y. Sung (2011), “Follow Me! GlobalMarketers’ Twitter Use,” Journal of InteractiveAdvertising, 12 (1), 4–16.

Kwon, K.H. (2019), “Public Referral, Viral Campaign, andCelebrity Participation: A Social Network Analysis of theIce Bucket Challenge on YouTube,” Journal of InteractiveAdvertising, 19 (2), 87–99.

Liu, J., J. Shang, C. Wang, X. Ren, and J. Han (2015),“Mining Quality Phrases from Massive Text Corpora,”presented at the 2015 ACM SIGMOD InternationalConference on Management of Data, Melbourne,Australia, June.

Lin, J.-S., and J. Pe~na (2011), “Are You Following Me? AContent Analysis of TV Networks’ BrandCommunication on Twitter,” Journal of InteractiveAdvertising, 12 (1), 17–29.

Liu, X. (2019), “Analyzing the Impact of User-GeneratedContent on B2B Firms’ Stock Performance: Big DataAnalysis with Machine Learning Methods,” IndustrialMarketing Management, published electronically March 8,

———, A.C. Burns, and Y. Hou (2017), “An Investigation ofBrand-Related User-Generated Content on Twitter,”Journal of Advertising, 46 (2), 236–47.

———, P.V. Singh, and K. Srinivasan (2016), “A StructuredAnalysis of Unstructured Big Data by Leveraging CloudComputing,” Marketing Science, 35 (3), 363–88.

Loper, E., and S. Bird (2002), “NLTK: The NaturalLanguage Toolkit,” in Proceedings of the ACL-02Workshop on Effective Tools and Methodologies for TeachNatural Language Processing and ComputationalLinguistics, Stroudsburg, PA: Association forComputational Linguistics, 63–70.

Mabey, B. (2015), “pyLDAvis.” https://github.com/bmabey/pyLDAvis.

Malthouse, E.C., and H. Li (2017), “Opportunities for andPitfalls of Using Big Data in Advertising Research,”Journal of Advertising, 46 (2), 227–35.

Mart�ınez, A.M., and A.C. Kak (2001), “PCA versus LDA,”IEEE Transactions on Pattern Analysis and MachineIntelligence, 23 (2), 228–33.

Mehrotra, R, S. Sanner, W. Buntine, and L. Xie (2013),“Improving LDA Topic Models for Microblogs via TweetPooling and Automatic Labeling,” in Proceedings of the36th International ACM SIGIR Conference on Researchand Development in Information Retrieval, New York:ACM, 889–92.

Meij, E., W. Weerkamp, and M. De Rijke (2012), “AddingSemantics to Microblog Posts,” presented at the FifthACM International Conference on Web Search and DataMining, Seattle, Washington, February.

Miner, G., J. Elder, IV, A. Fast, T. Hill, R. Nisbet, and D.Delen (2012), Practical Text Mining and StatisticalAnalysis for Non-Structured Text Data ApplicationsWaltham, MA: Academic Press.

Moe, W.W., O. Netzer, and D.A. Schweidel (2017), “SocialMedia Analytics,” in Handbook of Marketing DecisionModels, 2nd ed., B. Wierenga and R. Van der Lans, eds.,Cham, Switzerland: Springer, 483–504.

Murdough, C. (2009), “Social Media Measurement,” Journalof Interactive Advertising, 10 (1), 94–99.

Okazaki, S., A.M. D�ıaz-Mart�ın, M. Rozano, and H.D.Men�endez-Benito (2015), “Using Twitter to Engage withCustomers: A Data Mining Approach,” Internet Research,25 (3), 416–34.

Parameswaran, A., H. Garcia-Molina, and A. Rajaraman(2010), “Towards the Web of Concepts: ExtractingConcepts from Large Datasets,” Proceedings of the VLDBEndowment, 3 (1–2), 566–77.

Pedregosa, F., G. Varoquaux, A. Gramfort, V. Michel, B.Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss,V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau,and M. Brucher, M. Perrot, and �E. Duchesnay (2012),“Scikit-Learn: Machine Learning in Python,” Journal ofMachine Learning Research, 12, 2825–30.

Rathore, A.K., A.K. Kar, and P.V. Ilavarasan (2017), “SocialMedia Analytics: Literature Review and Directions forFuture Research,” Decision Analysis, 14, 229–49.

�Rehu�rek, R., and P. Sojka (2011), “Gensim—StatisticalSemantics in Python,” statistical semantics; gensim;Python; LDA; SVD.

Roesslein, J. (2009), “Tweepy,” https://tweepy.readthedocs.io/en/latest/.

12 J. T. YUN ET AL.

Page 13: Computationally Analyzing Social Media Text for Topics: A Primer …sundaram.cs.illinois.edu/pubs/2019/2019yun_primer.pdf · 2019-12-31 · In their book on text mining, Miner et

Rosenkrans, G., and K. Myers (2018), “Optimizing Location-Based Mobile Advertising Using Predictive Analytics,”Journal of Interactive Advertising, 18 (1), 43–54.

Salda~na, J. (2009), The Coding Manual for QualitativeResearchers, London: Sage.

Schomer, A. (2019), “Google Ad Revenue Growth Is SlowingAs Amazon Continues Eating into Its Share,” BusinessInsider, May 1, https://www.businessinsider.com/google-ad-revenue-growth-slows-amazon-taking-share-2019-5.

Shang, J., J. Liu, M. Jiang, X. Ren, C.R. Voss, and J. Han(2018), “Automated Phrase Mining from Massive TextCorpora,” IEEE Transactions on Knowledge and DataEngineering, 30 (10), 1825–37.

Skalski, P.D., K.A. Neuendorf, and J.A. Cajigas (2017),“Content Analysis in the Interactive Media Age,” in TheContent Analysis Guidebook, K.A. Neuendorf, ed.,Thousand Oaks: Sage Publications, 2 ed., 201–42.

Smith, A, and M. Anderson (2018), “Social Media Use in2018. Pew Research Center,” https://www.pewresearch.org/internet/2018/03/01/social-media-use-in-2018/.

Smith, M.A., B. Shneiderman, N. Milic-Frayling, E. MendesRodrigues, V. Barash, C. Dunne, T. Capone, A. Perer,and E. Gleave (2009), “Analyzing (Social Media)Networks with NodeXL,” in Proceedings of the FourthInternational Conference on Communities andTechnologies, New York: ACM, 255–64.

Soriano, J., T. Au, and D. Banks (2013), “Text Mining inComputational Advertising,” Statistical Analysis and DataMining, 6 (4), 273–85.

Trim, C. (2013), “The Art of Tokenization,” IBMCommunity, January 23, https://www.ibm.com/develo-perworks/community/blogs/nlp/entry/tokenization?lang=en.

Vermeer, S.A.M., T. Araujo, S.F. Bernritter, and G. vanNoort (2019), “Seeing the Wood for the Trees: HowMachine Learning Can Help Firms in IdentifyingRelevant Electronic Word-of-Mouth in Social Media,”International Journal of Research in Marketing, 36 (3),492–508.

Viera, A.J., and J.M. Garrett (2005), “UnderstandingInterobserver Agreement: The Kappa Statistic,” FamilyMedicine, 37 (5), 360–63.

Yun, J.T. (2018), “Analyzing the Boundaries of BalanceTheory in Evaluating Cause-Related MarketingCompatibility,” doctoral dissertation, University ofIllinois at Urbana–Champaign, https://www.ideals.illinois.edu/handle/2142/101522.

———, and B.R.L. Duff (2017), “Consumers As DataProfiles: What Companies See in the Social You,” inSocial Media: A Reference Handbook, K.S. Burns, ed.,Denver, CO: ABC-CLIO, 155–61.

———, N. Vance, C. Wang, L. Marini, J. Troy, C. Donelson,C.L. Chin, and M.D. Henderson (2019), “The SocialMedia Macroscope: A Science Gateway for ResearchUsing Social Media Data,” Future Generation ComputerSystems, published electronically, doi:10.1016/j.future.2019.10.029.

JOURNAL OF INTERACTIVE ADVERTISING 13