social media brand engagement as a proxy for e-commerce ...eprints.nottingham.ac.uk/55994/1/social...

8
Social Media Brand Engagement as a Proxy for E-commerce Activities: A Case Study of Sina Weibo and JD Weiqiang Lin , Pedro Saleiro , Natasa Milic-Frayling , Eugene Ch’ng § International Doctoral Innovation Center, University of Nottingham, Ningbo, China Department of Computer Science, University of Chicago, Chicago, USA School of Computer Science, University of Nottingham, Nottingham, UK § NVIDIA Joint-Lab on Mixed Reality, University of Nottingham, Ningbo, China {wyatt.lin, eugene.chng}@nottingham.edu.cn, [email protected], [email protected] Abstract—E-commerce platforms facilitate sales of products while product vendors engage in Social Media Activities (SMA) to drive E-commerce Platform Activities (EPA) of consumers, enticing them to search, browse and buy products. The frequency and timing of SMA are expected to affect levels of EPA, increasing the number of brand related queries, clickthrough, and purchase orders. This paper applies cross-sectional data analysis to explore such beliefs and demonstrates weak-to-moderate correlations between daily SMA and EPA volumes. Further correlation analysis, using 30-day rolling windows, shows a high variability in correlation of SMA-EPA pairs and calls into question the predictive potential of SMA in relation to EPA. Considering the moderate correlation of selected SMA and EPA pairs (e.g., Post-Orders), we investigate whether SMA features can predict changes in the EPA levels, instead of precise EPA daily volumes. We define such levels in terms of EPA distribution quantiles (2, 3, and 5 levels) over training data. We formulate the EPA quantile predictions as a multi-class categorization problem. The experiments with Random Forest and Logistic Regression show a varied success, performing better than random for the top quantiles of purchase orders and for the lowest quantile of search and clickthrough activities. Similar results are obtained when predicting multi-day cumulative EPA levels (1, 3, and 7 days). Our results have considerable practical implications but, most importantly, urge the common beliefs to be re-examined, seeking a stronger evidence of SMA effects on EPA. Index Terms—Social media activities, e-commerce platform activities, cross-sectional data analysis, time series, multi-class categorization, quantile level prediction. I. I NTRODUCTION E-commerce platforms enable promotion and sales of prod- ucts at large scales and resort to careful monitoring and optimization of product sales cycle in order to ensure quality services to both sellers and buyers. Thus, a substantial effort has been put into studying clickstreams and predicting product purchases [1]–[5]. Clickstream data typically includes search- ing and browsing for specific brands or products and clicks on product information for feature and price comparison. Patterns of such user activities have been successfully used to predict users’ intent to purchase once engaged with the platform [3]– [7]. Indeed, searching, browsing, and comparing products are aligned with the Purchase Decision Models (PDM) used in marketing and sales management [8] and therefore, good pre- dictors of consumer purchases. However, vendors are equally interested in creating awareness of their brands and products to drive traffic to their Web sites and e-commerce platforms. This Fig. 1: Rolling correlation between SMA on day t p and EPA on day t p+1 , with a 30-day rolling window over two years. has led to increased commitment to Social Media Monitoring (SMM) by vendors and, consequently, to the development of tools and services for marketeers to manage their Social Media Activities (SMA). The increased investment in SMM and SMA is driven by an underlying belief that social media is a key factor in business success, particularly for product promotion and sales. However, considering the global adoption of e-commerce platforms, it is important to examine that assumption and investigate to what degree vendors’ SMA affects consumer engagement with their products, i.e., the consumers’ E-commerce Platform Activities (EPA). In fact, we expect that the effects of vendors’ SMA are conflated with and dominated by advertising campaigns organized by the e-commerce platform itself and global retail events, such as Black Friday, 11-11, and June 18 in China. At the same time, EPA analyses and predictions of SMA effects are out of reach of individual vendors and marketeers as they require comprehensive data collections and analysis and sophisticated machine learning methods. Our research involves a cross-sectional data analysis of multiple time-series from SMA and EPA and a new approach to predicting EPAs, formulated as a multi-class categorization problem with classes defined by discretized time-series spec-

Upload: others

Post on 17-Oct-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Social Media Brand Engagement as a Proxy for E-commerce ...eprints.nottingham.ac.uk/55994/1/Social Media Brand... · of tools and services for marketeers to manage their Social Media

Social Media Brand Engagement as a Proxy forE-commerce Activities A Case Study of Sina

Weibo and JDWeiqiang Linlowast Pedro Saleirodagger Natasa Milic-FraylingDagger Eugene Chrsquongsect

lowastInternational Doctoral Innovation Center University of Nottingham Ningbo ChinadaggerDepartment of Computer Science University of Chicago Chicago USADaggerSchool of Computer Science University of Nottingham Nottingham UK

sectNVIDIA Joint-Lab on Mixed Reality University of Nottingham Ningbo Chinawyattlin eugenechngnottinghameducn NatasaMilic-Fraylingnottinghamacuk saleirouchicagoedu

AbstractmdashE-commerce platforms facilitate sales of productswhile product vendors engage in Social Media Activities (SMA)to drive E-commerce Platform Activities (EPA) of consumersenticing them to search browse and buy products The frequencyand timing of SMA are expected to affect levels of EPA increasingthe number of brand related queries clickthrough and purchaseorders This paper applies cross-sectional data analysis to exploresuch beliefs and demonstrates weak-to-moderate correlationsbetween daily SMA and EPA volumes Further correlationanalysis using 30-day rolling windows shows a high variabilityin correlation of SMA-EPA pairs and calls into question thepredictive potential of SMA in relation to EPA Consideringthe moderate correlation of selected SMA and EPA pairs (egPost-Orders) we investigate whether SMA features can predictchanges in the EPA levels instead of precise EPA daily volumesWe define such levels in terms of EPA distribution quantiles(2 3 and 5 levels) over training data We formulate the EPAquantile predictions as a multi-class categorization problem Theexperiments with Random Forest and Logistic Regression showa varied success performing better than random for the topquantiles of purchase orders and for the lowest quantile of searchand clickthrough activities Similar results are obtained whenpredicting multi-day cumulative EPA levels (1 3 and 7 days)Our results have considerable practical implications but mostimportantly urge the common beliefs to be re-examined seekinga stronger evidence of SMA effects on EPA

Index TermsmdashSocial media activities e-commerce platformactivities cross-sectional data analysis time series multi-classcategorization quantile level prediction

I INTRODUCTION

E-commerce platforms enable promotion and sales of prod-ucts at large scales and resort to careful monitoring andoptimization of product sales cycle in order to ensure qualityservices to both sellers and buyers Thus a substantial efforthas been put into studying clickstreams and predicting productpurchases [1]ndash[5] Clickstream data typically includes search-ing and browsing for specific brands or products and clicks onproduct information for feature and price comparison Patternsof such user activities have been successfully used to predictusersrsquo intent to purchase once engaged with the platform [3]ndash[7] Indeed searching browsing and comparing products arealigned with the Purchase Decision Models (PDM) used inmarketing and sales management [8] and therefore good pre-

dictors of consumer purchases However vendors are equallyinterested in creating awareness of their brands and products todrive traffic to their Web sites and e-commerce platforms This

Fig 1 Rolling correlation between SMA on day tp and EPAon day tp+1 with a 30-day rolling window over two years

has led to increased commitment to Social Media Monitoring(SMM) by vendors and consequently to the developmentof tools and services for marketeers to manage their SocialMedia Activities (SMA) The increased investment in SMMand SMA is driven by an underlying belief that social mediais a key factor in business success particularly for productpromotion and sales However considering the global adoptionof e-commerce platforms it is important to examine thatassumption and investigate to what degree vendorsrsquo SMAaffects consumer engagement with their products ie theconsumersrsquo E-commerce Platform Activities (EPA) In factwe expect that the effects of vendorsrsquo SMA are conflatedwith and dominated by advertising campaigns organized bythe e-commerce platform itself and global retail events suchas Black Friday 11-11 and June 18 in China At the sametime EPA analyses and predictions of SMA effects are outof reach of individual vendors and marketeers as they requirecomprehensive data collections and analysis and sophisticatedmachine learning methods

Our research involves a cross-sectional data analysis ofmultiple time-series from SMA and EPA and a new approachto predicting EPAs formulated as a multi-class categorizationproblem with classes defined by discretized time-series spec-

tra More precisely we specify distinct levels of EPA activitiesthat correspond to quantiles of the EPA data distribution overa given time period We use a data set provided by JD e-commerce platform (JDcom) that comprises search click-through and product orders for a sample of 33 vendors fromfive product categories For each vendor in the sample we col-lect social media posts reposts and commentaries publishedon the Sina Weibo (Weibocom) social media platform Thedata covers a period of two years We use time-alignment ofvendorsrsquo social media posts reposts and comments with EPAthat correspond to different decision stages in the consumersrsquopurchases search clickthrough and orders By relating thevendorsrsquo SMA and EPA we postulate

Should the data analysis reveal high correlation between avendorrsquos daily SMA and its customersrsquo daily EPA we mightbe able to create reliable predictors of EPA based on SMAfeatures

Our correlation analyses of specific SMA and EPA pairsie Pearson correlation of Post with Search Clickthroughwith Order and Repost with Comment data for different lagfactors (1-15 day shifts) point to a highly variable correlationover time as illustrated in Figure 1 Across the sample of33 vendors the correlation coefficients of SME-EPA pairsvary between 008 and 056 indicating that predicting dailyvolumes of EPA purely based on SMA would not be reliableNevertheless the question still remains

Could SMA features be used to predict changes in the levelsof EPA ie highs and lows in search clickthrough and ordersespecially if SMA has a sustained effect over several days eg3 to 7 days following a specific social media activity

In order to explore this we create predictive models forspecific levels of cumulative search clickthrough and ordersadded up over 1 3 and 7 days where levels are defined by thequantiles of the corresponding distributions over the trainingtime period We apply Random Forest and Logistic Regressionclassifiers for categories corresponding to 2 3 and 5 quantilesie (i) above the median (ii) within three quantiles (0-3333-66 66-100) and (iii) within five quantiles (0-20 20-40 40-60 60-80 and 80-100)

Our experiments show that SMA can be used to predict topquantiles of product orders and low quantiles of search andclickthrough across vendorsrsquo categories The performance forthe remaining ranges is on par with random predictions Theresults are important for optimizing the e-commerce platformoperations for peaks in EPA and for evaluating marketingcampaigns through SMA From the research perspective ourwork is the first to provide a detailed analysis of SMA-EPApairs and a method that reveals selective roles that SMA typesplay in predicting levels of specific EPA types

II RELATED WORK

Most of the existing work in understanding purchase behav-ior and user intent aims at estimating conversion rates for indi-vidual products individual customers or both [1] [3]ndash[7] [9]Computational studies of consumers behavior have adoptedad-hoc approaches and brute-force feature engineering aimingto uncover factors that explain user behavior particularly the

intent to buy [1] [2] [5] In contrast studies within businessand market research have led to conceptual frameworks thatare useful for both optimizing and interpreting experimentresults particularly the effects of features used in predictivemodels [4] [6]

Considering e-commerce platforms research has focusedon behavioral patterns product characteristics [3] [7] andfeedback in product reviews [10] and use them as featuresfor predicting the online product sale Such predictive fea-tures often originate from resources internal to the shoppingplatform [3] [11] or those collected externally such as socialmedia content [12] or query logs from online search engines[13]

With regards to social media past studies explored howcommercial intent expressed in tweets correlates with ex-ternal events [14] [15] Homophily was found to have asignificant influence on the purchase intent ie users are morelikely to purchase products similar to those purchased by theirfriends due to word-of-mouth effects [16] This phenomenonis important for e-commerce and has been studied in detailover the past decade [17] [18] It has also been recognizedthat other external factors such as geographic proximity mayexplain similarity in purchase patterns among friends on socialmedial [19] [20] Furthermore text analysis has been appliedto detect purchase intent from social media content [21] andFacebook profiles [22] and provide recommendations basedon micro-blogging activities [23]

We complement this work by considering social mediamonitoring practices and macro-level analyses that do notinvolve individual user profiles and content analysis but focuson the levels and timing of brand-specific social engagementsie volumes of vendorsrsquo post and the followers commentariesand repost Understanding temporal patterns of user activitieshas generated a wide interest over the past decade includingthe identification of broad classes of temporal patterns basedon activity peaks [24]ndash[27] Usually those classes are definedbased on specific volumes ranges and duration of activitiesbefore and after the peak Crane and Sornette [24] defineendogenous and exogenous origins of peaks based on triggersgenerated by internal aspects of the social network They foundthat popularity is mostly driven by exogenous factors insteadof endemic spreading Yang and Leskovec [27] propose a newmeasure of time series similarity and clustering

In our research we apply cross-sectional analysis of datafrom two distinct platforms rather than studying specific as-pects of commercial intent within social media or e-commerceplatforms themselves With the aim to explore the two sets oftemporal data we apply correlation analysis and formulatediscretized prediction problems that can leverage SMA fea-tures to predict certain levels of e-commerce activities despitea highly volatile correlation of the corresponding data series

III PRELIMINARIES AND DATA ANALYSIS

E-commerce platforms are instrumented to capture infor-mation about shopping activities and gather detailed statisticsabout consumersrsquo interactions with facilities that support thepurchase workflow Analyzing such data is instrumental for

optimizing the platform operations Our research is conductedin collaboration with JD1 with access to anonymized dataabout product sales of vendors in different product categoriesFurthermore we include information about vendorsrsquo socialengagements on Sina Weibo2 platform that hosts Web sitesof many JD vendors The vendors share post and engage thecommunity through comment and discussions We align SinaWeibo data and JD logs to understand how the vendorsrsquo SocialMedia Activities (SMA) on Weibo relate to the E-commercePlatform Activities (EPA) of JD users focusing on patternsof interaction within both services We perform a macro-level analysis considering time-series of actions analyses ofindividual usersrsquo behavior or content shared in social mediaare not part of this study

TABLE I Total social media activities of the sampledvendors within each product category for the two year period

(Jan 2016-Dec 2017)

Category Number of Vendors Post Comment RepostPhoneampEl 4 37312 5143200 2629674Sports 5 6927 295952 230733Food 9 17666 1072441 496664Clothes 6 9456 227258 155334Home 9 53857 1902164 681071

Fig 2 Total Post Comment and Repost statistics normalizedby the corresponding maxima over the two year period (Jan

2016-Dec 2017)A Data

Our EPA data set comprises JD records of (1) Search forbrands (2) Clickthrough and (3) Orders Search queries areincluded only if explicitly mention a vendorrsquos name The datawas collected for a sample of 50 most popular JD vendorsacross 5 product categories PhoneampElectronics Sports FoodClothes and Home on JDrsquos platform The vendorrsquos popularityranking is based on the JDrsquos sales performance metrics thatis treated as business confidential Among 50 companies weidentified 33 that have had a Weibo account for at least two

1JDcom is one of the largest e-commerce platforms in China2Weibocom is the largest social media platform in China with social media

features that combine those of Facebook (wwwfacebookcom) and Twitter(wwwtwittercom)

Fig 3 Total Search Clickthrough and Order volumes ofindividual vendors normalized by the daily maxima over the

two year period (Jan 2016-Dec 2017)

years from Jan 2016 until Dec 2017 and published at least 10posts over that period The final sample comprises 4 vendorsin PhoneampElectronics (PE) 5 in Sports 9 in Foods 6 inClothes and 9 in Home For each vendor we collected (1)Post (2) Repost and (3) Comment statistics

The EPA and SMA data is segmented per calendar dateTable I shows the SMA distribution per vendor category andFigure 2 plots the total SMA volumes of Post Repost andComment activities for individual vendors on a logarithmicscale As expected the Repost and Comment statistics aretwo orders of magnitude higher than Post illustrating astrong social media effect in response to the vendorsrsquo postingactivities Considering the SMA statistics across five categories(Figure 2) the vendors in the PhoneampElectronics categoryappear to gain most traction through repost and commentingactivities Sports category ranks lowest on posts while theClothes category is the lowest when considering the sum ofall three SMA statistics

Figure 3 presents the total Search Comment and Orderstatistics for individual vendors normalized by the maximumdaily volume over the period of two years It shows that onlinepurchases incur Search and Clickthrough volumes with 1-2orders of magnitude higher than Order volumes as they playimportant parts in the purchase decision process

B Correlation Analysis

We calculate the pairwise Pearson Correlation betweenSMA (Post Repost Comment) and EPA (Search Click-through Order) for the two year time-series corresponding tothe individual vendors Table II shows the highest correlationcoefficient for each SMA-EPA pair among all 33 vendorsIt confirms low-to-medium correlations of SMA-EPA acrossvendors with Post-Search having the highest maximum coef-ficient of 056 The maximum of 011 for Repost-Clickthroughand 008 for Repost-Orders indicate low correlations betweenthese SMA and EPA types across all the vendors

Heat maps in Figure 4 present the pairwise SMA-EPA cor-relation coefficients for each of the 33 vendors across the fulltwo year period by considering the next-day EPA statistics for

(a) Pearson correlations between SMA onday tp and Search on tp+1

(b) Pearson correlations between SMA onday tp and Clickthrough on tp+1

(c) Pearson correlations between SMA onday tp and Order on tp+1

Fig 4 Pearson Correlation between social media activities (Post Repost and Comment) and specific e-commerce platformactivities (Search Clickthrough and Order) for 33 vendors using their two year time series Heat maps show low to medium

correlations between SMA and EPA pairs

TABLE II The highest correlation coefficients acrossSMA-EPA pairs for the sample of 33 vendors

Post Comment RepostClickthrough 039 023 011Search 056 020 023Order 025 021 008

a given SMA day statistics Comparing the relative importanceof SMA types it appears that the Post statistics are morehighly correlated with Search and Clickthrough driving thebrand awareness and product interest Considering Commentstatistics across vendors they are more highly correlatedwith Clickthrough and Orders than with Search suggestingthat comments on posts and reposts may help with purchasedecisions Our analysis is the first to offer empirical evidencethat different types of social media engagements relate todifferent aspects of online shopping activities

We further consider SMA-EPA correlations for differenttime frames and time lags

1) Rolling Correlation Analysis We calculate the PearsonCorrelation Coefficient for SME-EPA pairs using the rolling30-day window over the period of two years (instead of thesingle two-year time-series) and show that the correlationcoefficients for daily statistics varies over time This is illus-trated in Figure 1 (Section I) showing time variations of thecorrelation coefficients for the vendor with the highest volumeof comments Without stable correlations it would be hard forexample to use Comment features to predict Orders

2) SMA-EPA Lag Analysis In our scenario SMA andEPA occur on different platforms ie Sina Weibo and JDrespectively Thus even if SMA has an effect on EPA onecan expect delays depending on the speed of social mediapropagation For that reason we investigate rolling SMA-EPAcorrelations with different day lags allowing for 1-15 daydelay By repeating the 30-day rolling window calculationswith 1-15 day shift in time series we observe a weekly patternin the rolling correlation with a spike every 7 days Howevereven for the vendor with the highest volume of Comment

activities the absolute correlation values are always below05 Generally the volumes of Post activities have the highestcorrelation with the volumes of Search or Clickthrough whileComment data has the highest correlation with Order

C Problem Formulation

Concluding from the presented analysis the task of pre-dicting daily volumes of e-commerce activities from levels ofsocial media activities is not well supported by the correla-tion analysis of daily statistics However in many practicalscenarios it is sufficient to predict a trend ie a change inthe volume range anticipating highs and lows that may affectproduct supplies and platform logistics Furthermore insteadof daily activities it is sufficient to predict purchase outcomesover a given period of time eg a cumulative sales for 3-7ahead Thus in Section IV we formulate the EPA predictionproblem by considering (a) discretized distributions of EPAvolumes for each vendor into 2 3 and 5 quantiles and (b)quantile predictions of EPA totals for the next day next 3days and next 7 days

IV PREDICTING E-COMMERCE ACTIVITIES

Let us now consider a set of vendors represented by theirbrands B = b1 b2 bi Without a loss of generalitywe can assume that each vendor is represented by a singlebrand that has a specific daily stream of social media signalsSt = s1 s2 sj consisting of official Post activitiesby the vendor and the Repost and Comment activities by theusers of the social media platform Similarly each brand hasa daily stream of e-commerce platform activities (EPA) Et =e1 e2 el corresponding to Search Clickthrough andOrder actions on a given day t Some of the social mediausers may also be customers of the e-commerce platform butwe do not attempt to align the user activities across the twoplatforms

We consider the predictive power of the social mediasignals regarding specific e-commerce platform activities iefv(bi t St Et) a discrete volume function of a brand bi

a timestamp t a social media stream St and a type of e-commerce activity stream Et We are in fact interested inaggregated volumes of specific EPA types and consider a timeframe T = [tp tp+h] where the date tp is the day of predictionand tp+h is the prediction horizon h = 1 3 7 (1-day 3-day7-day ahead) Therefore we aim to learn a proxy function fvthat at tp defines a cumulative function

FV (bi T SE) =

t=tp+h983131

t=tp

fv(bi t St Et)

integrating ie summing up fv over T

A Multi-class Predictions

Instead of predicting specific values of FV (bi S T Et) wepredict the distribution of FV in terms of quantile levels thatFV may attain on a given day or a given period of timeEvaluating the performance of quantile predictors enables usto assess whether different social media signals are predictiveof specific EPA levels We cast that as a multi-class categoriza-tion problem using supervised learning where the number ofquantiles q determines the number of classes Given a numberof quantiles q and a training set with values FV (bi S T Et)from the training time period we determine the data pointsFV (bi S T Et) within each quantile ie

Pr(FV (bi S T Et) le fk) = kq forallk isin [1 q minus 1]

and the corresponding ranges of FV values The mini-mummaximum values determine the thresholds for assigninga class label to an instance of FV

Training of the multi-class predictor is based on partitioningthe training set into distribution quantiles with correspondingclass labels and value ranges We use the quantile rangesto assign class labels to FV values of the test set If thetest distribution is broader than the one of the training setFV values below the minimum or above the maximum areassigned to the lowesthighest quantile respectively

Given a time of prediction tp we extract features fromthe social media stream S up to tp and predict FV for theprediction horizon tp+h We consider activities on a daily basisand use tp as 235959 on each day

B Feature Construction

Our feature set is based on the three types of SMA signalsPost Repost and Comment activities For each signal type wegenerate features by calculating statistics from SMA streamsfor the time spans of a pre-specified length In particular weconsider K-day periods with Kisin1 3 5 7 Table III showsthe statistics that we calculate for each signal type ie PostRepost Comment over the specific stream length K TheTheil-Sen estimator [28] is a non-parametric trend detectorthat corresponds to the median over all possible combinationsof slopes over K daily measures of the SMA activity In totalwe construct 22 features for each of the 3 SMA types thetotal of 66 features that characterize a vendorrsquos SMA

TABLE III Description of statistics used as features for eachSMA type over K days before tp For K = 1 we consider

only Featsum

Feature DescriptionFeatsum Sum of activity volume over K-day periodFeatmean Mean activity volume over K-day periodFeatmax Maximum activity volume over K-day periodFeatmin Maximum activity volume over K-day periodFeatvar Variance of activity volume over K-day periodFeatstdev Standard deviation of activity volume over K-day pe-

riodFeatprev Activity volume on previous dayFeattheil Theil-Sen estimator

V EXPERIMENTS

We use Logistic Regression and Random Forest [29]ndash[31] tolearn multi-class classification models for 2 3 and 5 quantiles(2-q 3-q 5-q) and vary the prediction horizon by modelingnext day (1-day) 3-day and 7-day cumulative volumes ofSearch Clickthrough and Orders We evaluate multi-class pre-diction results by calculating standard precision recall and F1statistics However we focus our discussion on the precisionsince the aim is to assess the effectiveness of SMA featuresin predicting volume levels of specific EPA types Thusidentifying true-positive instances within a specific quantileis given a priority over avoiding false-negative instances Infact correct predictions of high (top 20-30) and low(bottom 20-30) quantiles are of particular interest sincetheir detection can improve SMA campaigns and optimize e-commerce operations for high levels of activities that on somedays such as global shopping events can increase 100-fold

Fig 5 Sliding window of a 12-month training and 1-monthtest data with a shift of one calendar month to cover the test

period from Jan 2017 until Dec 2017

1) Temporal Cross-validation All our experiments areperformed using 12-fold time-series cross-validation as shownin Figure 5 with data sets specific to each vendor We trainand test our models using a two year JD data set (SectionIII-A) Our starting training set covers the 12 month periodfrom 1 January 2016 to 31 December 2016 The test set isthe following month In each fold we slide the 12-monthtraining and 1-month test period by one calendar month Thusfor each experiment we use one year of historical data andpredict EPA in every calendar month in 2017 For each vendorwe report the average precision statistics across 12-folds Wecollate and average precision statistics across 33 vendors andacross vendor categories

2) Activity Predictions For a given quantile scale (2-q 3-q5-q) we predict volumes of activities for the next day tp+1 (1-

(a) Precision statistics for 3-q next-day predictions for Orders (b) Precision statistics for 5-q next-day predictions for Orders

Fig 6 Precision of Random Forest predictors for 3-q and 5-q categories for the next-day Orders across 33 vendors The topquantile precisions are higher than random predictions (33 for 3-q and 20 for 5-q)

TABLE IV Precision statistics for Random Forest 5-q classifiers for the next-day volumes of each EPA type OrderClickthrough and Search Results are aggregated per five categories

Vendor 0-20 20-40 40-60 60-80 80-100AVG MAX MIN AVG MAX MIN AVG MAX MIN AVG MAX MIN AVG MAX MIN

Order

PampE 0159 0338 0013 0136 0218 0054 0198 0292 0070 0298 0570 0149 0341 0393 0271Sports 0129 0340 0000 0161 0370 0045 0154 0263 0021 0247 0425 0149 0456 0653 0154Food 0070 0231 0000 0090 0302 0000 0144 0271 0030 0313 0462 0181 0410 0708 0213Clothes 0199 0571 0000 0129 0217 0011 0186 0385 0065 0225 0464 0018 0308 0528 0076Home 0167 0596 0000 0157 0317 0000 0263 0517 0092 0222 0333 0091 0312 0481 0079

Clickthrough

PampE 0371 0548 0192 0200 0311 0099 0147 0241 0025 0159 0353 0000 0265 0565 0055Sports 0462 0631 0275 0264 0393 0094 0144 0328 0038 0135 0348 0024 0148 0286 0037Food 0351 0860 0083 0223 0488 0056 0177 0345 0030 0136 0290 0000 0162 0385 0036Clothes 0312 0571 0013 0165 0260 0030 0199 0426 0012 0145 0308 0000 0192 0500 0023Home 0324 0645 0100 0225 0347 0116 0188 0385 0025 0149 0259 0039 0174 0365 0000

Search

PampE 0521 0679 0336 0231 0353 0065 0109 0293 0029 0077 0176 0000 0255 0481 0070Sports 0483 0763 0276 0248 0350 0103 0130 0222 0000 0081 0173 0000 0126 0250 0066Food 0308 0784 0089 0193 0370 0027 0183 0324 0027 0175 0373 0052 0185 0317 0048Clothes 0329 0628 0194 0251 0353 0114 0232 0313 0131 0127 0238 0020 0119 0288 0043Home 0351 0573 0179 0232 0323 0041 0200 0322 0083 0200 0311 0098 0194 0316 0086

day) and the multi-day cumulative activities over 3-day and7-day periods For each prediction type and each individualvendor we use the corresponding quantile scale determinedover the training data Our experiments thus involve 891predictions for each of 33 vendors 3 EPA types (SearchClickthrough Orders) with three quantile scales (2-q 3-qand 5-q) and 3 time periods (1-day 3-day 7-day) For thisdiscussion we select experiments that shed light on

bull How successful the predictors are in identifying quantilesfor individual EPA types across the vendor sample

bull How well the predictors perform for the cumulative 3-dayand 7-day activities across the vendor sample

bull Which features significantly contribute to the perfor-mance of predictors for the individual EPA type

A Next-day EPA Predictions for 3 and 5 QuantilesExperiments with the next day predictions of EPA for 3-q

and 5-q show that the Random Forest (RF) predictors performhigher than random for Orders in the top quantiles (top 33and 20 respectively) Figure 6 presents RF precision forOrder quantiles for individual vendors Table IV summarizesRF precision statistics for all three EPA types and the fivevendor categories We highlight the RF results that are betterthan random predictions (gt 020 for 5-q) We see thatin addition to the top quantile predictions for Orders theprecision statistics are better than random for Search andClickthrough in the lowest quantile (bottom 33 for 3-q and

bottom 20 for 5-q) All other quantile predictions for OrderClickthrough and Search are on par or lower than random

We conducted the same experiments with Logistic Regres-sion (LR) and found that RF outperforms LR predictors Infact for our sample of product vendors the t-test performedfor LR and RF predictions yield p lt 005 for all the quantilelabels (not just the top quantiles)

B Predictions of Multi-day Cumulative EPA

Considering a selective success of SMA based predictorsfor quantiles of the next-day EPA we train predictors forcumulative EPA ie EPA volumes over 3 and 7 days Weexpect that multi-day cumulative statistics are less volatileand therefore the quantile levels may be more stable andpredictable over time

Our experiments show that RF predictors for 3-day and7-day cumulative EPA are consistently performing at thesimilar order of magnitude for a given quantile level That isillustrated in Table V showing the precision statistics of RFpredictors for 1-day 3-day and 7-day cumulative volumes ofOrder Clickthrough and Search for 3 quantiles (3-q) Similarobservations are made for 5 quantiles and for the LR classifier

We conclude that our high performing predictors for cu-mulative multi-day EPA volumes can be flexibly applied todifferent scenarios that benefit from observing cumulativeEPA across multiple days This includes the precision of topquantile volumes for Orders Figure 7 shows 3-q predictions

TABLE V Three quantile prediction results for five categories of products across 33 vendors based on Random Forestreporting average precision by categories of 1-day (1D) 3-day (3D) and 7-day (7D) cumulative Orders Clickthrough and

Search for each quantile

Vendor 0-33 33-66 66-1001D 3D 7D 1D 3D 7D 1D 3D 7D

Order

PE 0232 0225 0237 0327 0329 0303 0522 0536 0551Sports 0241 0212 0172 0237 0240 0298 0584 0598 0607Food 0141 0139 0133 0288 0230 0251 0663 0656 0675Clothes 0287 0286 0280 0297 0272 0310 0397 0431 0478Home 0261 0266 0244 0375 0383 0351 0453 0467 0464AVG 0226 0222 0209 0310 0293 0302 0528 0540 0556

Clickthrough

PE 0488 0480 0471 0269 0303 0232 0385 0362 0372Sports 0637 0620 0623 0226 0263 0214 0232 0223 0266Food 0464 0471 0476 0282 0285 0286 0263 0266 0285Clothes 0418 0399 0423 0301 0311 0293 0331 0328 0329Home 0475 0487 0484 0327 0307 0289 0285 0278 0256AVG 0488 0486 0490 0288 0295 0271 0291 0286 0293

Search

PE 0597 0621 0655 0182 0194 0150 0277 0259 0289Sports 0621 0613 0616 0266 0248 0263 0166 0157 0152Food 0442 0413 0440 0316 0319 0292 0299 0291 0276Clothes 0455 0479 0476 0375 0370 0355 0225 0205 0172Home 0453 0474 0524 0360 0313 0298 0330 0327 0326AVG 0493 0497 0522 0315 0301 0283 0271 0261 0254

(a) Precision of bottom quantile predictionsfor Orders

(b) Precision of middle quantile predictionsfor Orders

(c) Precision of top quantile prediction forOrders

Fig 7 Precision of Random Forest predictions into three quantiles for 1-day 3-day and 7-day cumulative Orders

of 1-day 3-day and 7-day cumulative Orders for individualvendors

C Feature Significance

Both Random Forest and Logistic Regression allow us toassess the significance of individual feature types in terms oftheir contributions to the prediction decision For illustrationwe present the analysis of features for the RF classifier trainedfor 3-q predictions of the next-day EPA volumes For eachclassifier we calculate the average relative rank of a featureTable VI shows the top 10 contributing features of 1-day EPApredictors for Orders Clickthrough and Search respectivelyacross all the vendors and 3 quantiles

We observe that the relative feature importance is to adegree in agreement with the observations from the corre-lation analysis in Section V-2 Search levels are predictedby Comment and Repost activities considered over differentlengths of time 1 3 5 and 7 days Clickthrough activitiesseem to be aligned with SMA over 7 and 3 day periodsprimarily with consumer comments Orders however areclearly related to Comment volumes over a longer periodsie 5 and 7 days Overall we recognize the importance of

features generated from Comment activities as they contributeto the predictors of all EPA types

TABLE VI Ten top-ranked features of the 3-q RandomForest predictors for the next-day EPA

Rank Order Clickthrough Search1 7DCommentmin 7DCommentmin 7DCommentmin

2 7DCommentsum 3DCommentmean 5DRepostmin

3 7DCommentmean 3DCommentsum PreviousDComment4 7DCommentmax PreviousDComment 7DCommentstd5 5DCommentmean 3DCommentmax 3DCommentmean

6 5DCommentmax 7DCommentsum 7DRepoststd7 5DCommentsum 7DCommentmean 5DRepostsum8 7DCommentstd 7DPostsum 3DCommentsum9 7DCommentvar 3DCommentmin 7DCommentvar10 3DCommentmean 5DCommentvar 5DCommentvar

VI CONCLUDING REMARKS

This paper presents a detailed empirical study of the re-lationship between SMA of product vendors and EPA ofconsumers interested in the vendorsrsquo products The studyis the first to characterize the correlation of specific SMAengagements ie posts reposts and comments and EPA typesthat correspond to specific stages in the consumersrsquo purchasedecisions ie search for brands and products clickthrough

product information and product orders Our analyses un-covered low-to-moderate correlations between the volumes ofSMA-EPA pairs suggesting that predicting daily volumes ofEPA based on SMA volumes alone is not well supported How-ever moderate correlations of Post-Search Post-Clickthroughand Comment-Order across vendors suggest that one may beable to train predictors of EPA distributions and their changesrather that the precise daily volumes We thus introduce a newapproach of characterizing SMA-EPA relationship in terms ofEPA quantiles We formulate EPA predictions as a multi-classcategorization problem into 2 3 and 5 quantiles Our RandomForest classifiers outperform both the random predictors andthe Logistic Regression classifiers for top quantiles of Orderand bottom quantiles of Search and Clickthrough Our studyprovides unique insights into the varied correlations of SMA-EPA pairs and a mixed success of SMA-based predictorsin determining EPA levels A general view that social me-dia engagements undoubtedly drive consumer traction on e-commerce platforms is not substantiated by the Sina Weiboand JD case study and requires further analysis In our futurework we will expand explorations of this issue with a broaderset of product categories and SMA analyses that considerproperties of the content exchanged through the SMA

VII ACKNOWLEDGEMENTS

We wish to express our gratitude to JDcom for providingaccess to invaluable data and Bin Xu and Hao Dong for theirassistance We also acknowledge the financial support from theInternational Doctoral Innovation Centre Ningbo EducationBureau Ningbo Science and Technology Bureau and theUniversity of Nottingham This work was also supported bythe UK Engineering and Physical Sciences Research Council[grant number EPL0154631]

REFERENCES

[1] J Yeo S Kim E Koh S-w Hwang and N Lipka ldquoPredicting onlinepurchase conversion for retargetingrdquo in Proceedings of the 10th ACMInternational Conference on Web Search and Data Mining pp 591ndash600ACM 2017

[2] C Lo D Frankowski and J Leskovec ldquoUnderstanding behaviors thatlead to purchasing A case study of pinterestrdquo in Proceedings of the22th ACM SIGKDD International Conference on Knowledge Discoveryand Data Mining pp 531ndash540 2016

[3] W W Moe ldquoBuying searching or browsing Differentiating betweenonline shoppers using in-store navigational clickstreamrdquo Journal ofConsumer Psychology vol 13 no 1-2 pp 29ndash39 2003

[4] C Sismeiro and R E Bucklin ldquoModeling purchase behavior at an e-commerce web site A task-completion approachrdquo Journal of MarketingResearch vol 41 no 3 pp 306ndash323 2004

[5] C Park D Kim J Oh and H Yu ldquoPredicting user purchase in e-commerce by comprehensive feature engineering and decision boundaryfocused under-samplingrdquo in Proceedings of the 2015 International ACMRecommender Systems Challenge p 8 ACM 2015

[6] D Van den Poel and W Buckinx ldquoPredicting online-purchasing be-haviourrdquo European Journal of Operational Research vol 166 no 2pp 557ndash575 2005

[7] D J Bertsimas A J Mersereau and N R Patel ldquoDynamic classifica-tion of online customersrdquo in Proceedings of the 2003 SIAM InternationalConference on Data Mining pp 107ndash118 SIAM 2003

[8] K L Kotler Philip Keller Marketing Management volume 14 PrenticeHall Englewood Cliffs 2015

[9] L Xu J A Duan and A Whinston ldquoPath to purchase A mutuallyexciting point process model for online advertising and conversionrdquoManagement Science vol 60 no 6 pp 1392ndash1412 2014

[10] X Yu Y Liu X Huang and A An ldquoMining online reviews forpredicting sales performance A case study in the movie domainrdquoIEEE Transactions on Knowledge and Data Engineering vol 24 no 4pp 720ndash734 2012

[11] E Young Kim and Y-K Kim ldquoPredicting online purchase intentionsfor clothing productsrdquo European Journal of Marketing vol 38 no 7pp 883ndash897 2004

[12] D Hawkins D Mothersbaugh and R Best ldquoConsumer behaviourBuilding marketing strategy mcgraw hillrdquo 2012

[13] G Kulkarni P Kannan and W Moe ldquoUsing online search data toforecast new product salesrdquo Decision Support Systems vol 52 no 3pp 604ndash611 2012

[14] J Wang W X Zhao H Wei H Yan and X Li ldquoMining new businessopportunities Identifying trend related products by leveraging commer-cial intents from microblogsrdquo in Proceedings of the 2013 Conferenceon Empirical Methods in Natural Language Processing pp 1337ndash13472013

[15] B Hollerit M Kroll and M Strohmaier ldquoTowards linking buyers andsellers detecting commercial intent on twitterrdquo in Proceedings of the22nd International Conference on World Wide Web pp 629ndash632 ACM2013

[16] F Kooti K Lerman L M Aiello M Grbovic N Djuric and V Ra-dosavljevic ldquoPortrait of an online shopper Understanding and predictingconsumer behaviorrdquo in Proceedings of the 9th ACM InternationalConference on Web Search and Data Mining pp 205ndash214 ACM 2016

[17] J Leskovec L A Adamic and B A Huberman ldquoThe dynamics ofviral marketingrdquo ACM Transactions on The Web (TWEB) vol 1 no 1p 5 2007

[18] A Anagnostopoulos R Kumar and M Mahdian ldquoInfluence and cor-relation in social networksrdquo in Proceedings of the 14th ACM SIGKDDInternational Conference on Knowledge Discovery and Data Miningpp 7ndash15 ACM 2008

[19] D Crandall D Cosley D Huttenlocher J Kleinberg and S SurildquoFeedback effects between similarity and social influence in onlinecommunitiesrdquo in Proceedings of the 14th ACM SIGKDD InternationalConference on Knowledge Discovery and Data Mining pp 160ndash168ACM 2008

[20] C R Shalizi and A C Thomas ldquoHomophily and contagion are generi-cally confounded in observational social network studiesrdquo SociologicalMethods amp Research vol 40 no 2 pp 211ndash239 2011

[21] V Gupta D Varshney H Jhamtani D Kedia and S Karwa ldquoIdenti-fying purchase intent from social postsrdquo in ICWSM 2014

[22] Y Zhang and M Pennacchiotti ldquoPredicting purchase behaviors fromsocial mediardquo in Proceedings of the 22nd International Conference onWorld Wide Web pp 1521ndash1532 ACM 2013

[23] X W Zhao Y Guo Y He H Jiang Y Wu and X Li ldquoWe knowwhat you want to buy a demographic-based system for product recom-mendation on microblogsrdquo in Proceedings of the 20th ACM SIGKDDInternational Conference on Knowledge Discovery and Data Miningpp 1935ndash1944 ACM 2014

[24] R Crane and D Sornette ldquoRobust dynamic classes revealed by mea-suring the response function of a social systemrdquo Proceedings of theNational Academy of Sciences vol 105 no 41 pp 15649ndash15653 2008

[25] J Lehmann B Goncalves J J Ramasco and C Cattuto ldquoDynamicalclasses of collective attention in twitterrdquo in Proceedings of the 21stInternational Conference on World Wide Web pp 251ndash260 ACM 2012

[26] D M Romero B Meeder and J Kleinberg ldquoDifferences in the mechan-ics of information diffusion across topics idioms political hashtags andcomplex contagion on twitterrdquo in Proceedings of the 20th InternationalConference on World Wide Web pp 695ndash704 ACM 2011

[27] J Yang and J Leskovec ldquoPatterns of temporal variation in onlinemediardquo in Proceedings of the 4th ACM International Conference onWeb Search and Data Mining pp 177ndash186 ACM 2011

[28] M G Akritas S A Murphy and M P Lavalley ldquoThe theil-senestimator with doubly censored data and applications to astronomyrdquoJournal of the American Statistical Association vol 90 no 429pp 170ndash177 1995

[29] D W Hosmer Jr S Lemeshow and R X Sturdivant Applied logisticregression vol 398 John Wiley amp Sons 2013

[30] H-F Yu F-L Huang and C-J Lin ldquoDual coordinate descent methodsfor logistic regression and maximum entropy modelsrdquo Machine Learn-ing vol 85 no 1 pp 41ndash75 2011

[31] A Liaw M Wiener et al ldquoClassification and regression by random-forestrdquo R News vol 2 no 3 pp 18ndash22 2002

Page 2: Social Media Brand Engagement as a Proxy for E-commerce ...eprints.nottingham.ac.uk/55994/1/Social Media Brand... · of tools and services for marketeers to manage their Social Media

tra More precisely we specify distinct levels of EPA activitiesthat correspond to quantiles of the EPA data distribution overa given time period We use a data set provided by JD e-commerce platform (JDcom) that comprises search click-through and product orders for a sample of 33 vendors fromfive product categories For each vendor in the sample we col-lect social media posts reposts and commentaries publishedon the Sina Weibo (Weibocom) social media platform Thedata covers a period of two years We use time-alignment ofvendorsrsquo social media posts reposts and comments with EPAthat correspond to different decision stages in the consumersrsquopurchases search clickthrough and orders By relating thevendorsrsquo SMA and EPA we postulate

Should the data analysis reveal high correlation between avendorrsquos daily SMA and its customersrsquo daily EPA we mightbe able to create reliable predictors of EPA based on SMAfeatures

Our correlation analyses of specific SMA and EPA pairsie Pearson correlation of Post with Search Clickthroughwith Order and Repost with Comment data for different lagfactors (1-15 day shifts) point to a highly variable correlationover time as illustrated in Figure 1 Across the sample of33 vendors the correlation coefficients of SME-EPA pairsvary between 008 and 056 indicating that predicting dailyvolumes of EPA purely based on SMA would not be reliableNevertheless the question still remains

Could SMA features be used to predict changes in the levelsof EPA ie highs and lows in search clickthrough and ordersespecially if SMA has a sustained effect over several days eg3 to 7 days following a specific social media activity

In order to explore this we create predictive models forspecific levels of cumulative search clickthrough and ordersadded up over 1 3 and 7 days where levels are defined by thequantiles of the corresponding distributions over the trainingtime period We apply Random Forest and Logistic Regressionclassifiers for categories corresponding to 2 3 and 5 quantilesie (i) above the median (ii) within three quantiles (0-3333-66 66-100) and (iii) within five quantiles (0-20 20-40 40-60 60-80 and 80-100)

Our experiments show that SMA can be used to predict topquantiles of product orders and low quantiles of search andclickthrough across vendorsrsquo categories The performance forthe remaining ranges is on par with random predictions Theresults are important for optimizing the e-commerce platformoperations for peaks in EPA and for evaluating marketingcampaigns through SMA From the research perspective ourwork is the first to provide a detailed analysis of SMA-EPApairs and a method that reveals selective roles that SMA typesplay in predicting levels of specific EPA types

II RELATED WORK

Most of the existing work in understanding purchase behav-ior and user intent aims at estimating conversion rates for indi-vidual products individual customers or both [1] [3]ndash[7] [9]Computational studies of consumers behavior have adoptedad-hoc approaches and brute-force feature engineering aimingto uncover factors that explain user behavior particularly the

intent to buy [1] [2] [5] In contrast studies within businessand market research have led to conceptual frameworks thatare useful for both optimizing and interpreting experimentresults particularly the effects of features used in predictivemodels [4] [6]

Considering e-commerce platforms research has focusedon behavioral patterns product characteristics [3] [7] andfeedback in product reviews [10] and use them as featuresfor predicting the online product sale Such predictive fea-tures often originate from resources internal to the shoppingplatform [3] [11] or those collected externally such as socialmedia content [12] or query logs from online search engines[13]

With regards to social media past studies explored howcommercial intent expressed in tweets correlates with ex-ternal events [14] [15] Homophily was found to have asignificant influence on the purchase intent ie users are morelikely to purchase products similar to those purchased by theirfriends due to word-of-mouth effects [16] This phenomenonis important for e-commerce and has been studied in detailover the past decade [17] [18] It has also been recognizedthat other external factors such as geographic proximity mayexplain similarity in purchase patterns among friends on socialmedial [19] [20] Furthermore text analysis has been appliedto detect purchase intent from social media content [21] andFacebook profiles [22] and provide recommendations basedon micro-blogging activities [23]

We complement this work by considering social mediamonitoring practices and macro-level analyses that do notinvolve individual user profiles and content analysis but focuson the levels and timing of brand-specific social engagementsie volumes of vendorsrsquo post and the followers commentariesand repost Understanding temporal patterns of user activitieshas generated a wide interest over the past decade includingthe identification of broad classes of temporal patterns basedon activity peaks [24]ndash[27] Usually those classes are definedbased on specific volumes ranges and duration of activitiesbefore and after the peak Crane and Sornette [24] defineendogenous and exogenous origins of peaks based on triggersgenerated by internal aspects of the social network They foundthat popularity is mostly driven by exogenous factors insteadof endemic spreading Yang and Leskovec [27] propose a newmeasure of time series similarity and clustering

In our research we apply cross-sectional analysis of datafrom two distinct platforms rather than studying specific as-pects of commercial intent within social media or e-commerceplatforms themselves With the aim to explore the two sets oftemporal data we apply correlation analysis and formulatediscretized prediction problems that can leverage SMA fea-tures to predict certain levels of e-commerce activities despitea highly volatile correlation of the corresponding data series

III PRELIMINARIES AND DATA ANALYSIS

E-commerce platforms are instrumented to capture infor-mation about shopping activities and gather detailed statisticsabout consumersrsquo interactions with facilities that support thepurchase workflow Analyzing such data is instrumental for

optimizing the platform operations Our research is conductedin collaboration with JD1 with access to anonymized dataabout product sales of vendors in different product categoriesFurthermore we include information about vendorsrsquo socialengagements on Sina Weibo2 platform that hosts Web sitesof many JD vendors The vendors share post and engage thecommunity through comment and discussions We align SinaWeibo data and JD logs to understand how the vendorsrsquo SocialMedia Activities (SMA) on Weibo relate to the E-commercePlatform Activities (EPA) of JD users focusing on patternsof interaction within both services We perform a macro-level analysis considering time-series of actions analyses ofindividual usersrsquo behavior or content shared in social mediaare not part of this study

TABLE I Total social media activities of the sampledvendors within each product category for the two year period

(Jan 2016-Dec 2017)

Category Number of Vendors Post Comment RepostPhoneampEl 4 37312 5143200 2629674Sports 5 6927 295952 230733Food 9 17666 1072441 496664Clothes 6 9456 227258 155334Home 9 53857 1902164 681071

Fig 2 Total Post Comment and Repost statistics normalizedby the corresponding maxima over the two year period (Jan

2016-Dec 2017)A Data

Our EPA data set comprises JD records of (1) Search forbrands (2) Clickthrough and (3) Orders Search queries areincluded only if explicitly mention a vendorrsquos name The datawas collected for a sample of 50 most popular JD vendorsacross 5 product categories PhoneampElectronics Sports FoodClothes and Home on JDrsquos platform The vendorrsquos popularityranking is based on the JDrsquos sales performance metrics thatis treated as business confidential Among 50 companies weidentified 33 that have had a Weibo account for at least two

1JDcom is one of the largest e-commerce platforms in China2Weibocom is the largest social media platform in China with social media

features that combine those of Facebook (wwwfacebookcom) and Twitter(wwwtwittercom)

Fig 3 Total Search Clickthrough and Order volumes ofindividual vendors normalized by the daily maxima over the

two year period (Jan 2016-Dec 2017)

years from Jan 2016 until Dec 2017 and published at least 10posts over that period The final sample comprises 4 vendorsin PhoneampElectronics (PE) 5 in Sports 9 in Foods 6 inClothes and 9 in Home For each vendor we collected (1)Post (2) Repost and (3) Comment statistics

The EPA and SMA data is segmented per calendar dateTable I shows the SMA distribution per vendor category andFigure 2 plots the total SMA volumes of Post Repost andComment activities for individual vendors on a logarithmicscale As expected the Repost and Comment statistics aretwo orders of magnitude higher than Post illustrating astrong social media effect in response to the vendorsrsquo postingactivities Considering the SMA statistics across five categories(Figure 2) the vendors in the PhoneampElectronics categoryappear to gain most traction through repost and commentingactivities Sports category ranks lowest on posts while theClothes category is the lowest when considering the sum ofall three SMA statistics

Figure 3 presents the total Search Comment and Orderstatistics for individual vendors normalized by the maximumdaily volume over the period of two years It shows that onlinepurchases incur Search and Clickthrough volumes with 1-2orders of magnitude higher than Order volumes as they playimportant parts in the purchase decision process

B Correlation Analysis

We calculate the pairwise Pearson Correlation betweenSMA (Post Repost Comment) and EPA (Search Click-through Order) for the two year time-series corresponding tothe individual vendors Table II shows the highest correlationcoefficient for each SMA-EPA pair among all 33 vendorsIt confirms low-to-medium correlations of SMA-EPA acrossvendors with Post-Search having the highest maximum coef-ficient of 056 The maximum of 011 for Repost-Clickthroughand 008 for Repost-Orders indicate low correlations betweenthese SMA and EPA types across all the vendors

Heat maps in Figure 4 present the pairwise SMA-EPA cor-relation coefficients for each of the 33 vendors across the fulltwo year period by considering the next-day EPA statistics for

(a) Pearson correlations between SMA onday tp and Search on tp+1

(b) Pearson correlations between SMA onday tp and Clickthrough on tp+1

(c) Pearson correlations between SMA onday tp and Order on tp+1

Fig 4 Pearson Correlation between social media activities (Post Repost and Comment) and specific e-commerce platformactivities (Search Clickthrough and Order) for 33 vendors using their two year time series Heat maps show low to medium

correlations between SMA and EPA pairs

TABLE II The highest correlation coefficients acrossSMA-EPA pairs for the sample of 33 vendors

Post Comment RepostClickthrough 039 023 011Search 056 020 023Order 025 021 008

a given SMA day statistics Comparing the relative importanceof SMA types it appears that the Post statistics are morehighly correlated with Search and Clickthrough driving thebrand awareness and product interest Considering Commentstatistics across vendors they are more highly correlatedwith Clickthrough and Orders than with Search suggestingthat comments on posts and reposts may help with purchasedecisions Our analysis is the first to offer empirical evidencethat different types of social media engagements relate todifferent aspects of online shopping activities

We further consider SMA-EPA correlations for differenttime frames and time lags

1) Rolling Correlation Analysis We calculate the PearsonCorrelation Coefficient for SME-EPA pairs using the rolling30-day window over the period of two years (instead of thesingle two-year time-series) and show that the correlationcoefficients for daily statistics varies over time This is illus-trated in Figure 1 (Section I) showing time variations of thecorrelation coefficients for the vendor with the highest volumeof comments Without stable correlations it would be hard forexample to use Comment features to predict Orders

2) SMA-EPA Lag Analysis In our scenario SMA andEPA occur on different platforms ie Sina Weibo and JDrespectively Thus even if SMA has an effect on EPA onecan expect delays depending on the speed of social mediapropagation For that reason we investigate rolling SMA-EPAcorrelations with different day lags allowing for 1-15 daydelay By repeating the 30-day rolling window calculationswith 1-15 day shift in time series we observe a weekly patternin the rolling correlation with a spike every 7 days Howevereven for the vendor with the highest volume of Comment

activities the absolute correlation values are always below05 Generally the volumes of Post activities have the highestcorrelation with the volumes of Search or Clickthrough whileComment data has the highest correlation with Order

C Problem Formulation

Concluding from the presented analysis the task of pre-dicting daily volumes of e-commerce activities from levels ofsocial media activities is not well supported by the correla-tion analysis of daily statistics However in many practicalscenarios it is sufficient to predict a trend ie a change inthe volume range anticipating highs and lows that may affectproduct supplies and platform logistics Furthermore insteadof daily activities it is sufficient to predict purchase outcomesover a given period of time eg a cumulative sales for 3-7ahead Thus in Section IV we formulate the EPA predictionproblem by considering (a) discretized distributions of EPAvolumes for each vendor into 2 3 and 5 quantiles and (b)quantile predictions of EPA totals for the next day next 3days and next 7 days

IV PREDICTING E-COMMERCE ACTIVITIES

Let us now consider a set of vendors represented by theirbrands B = b1 b2 bi Without a loss of generalitywe can assume that each vendor is represented by a singlebrand that has a specific daily stream of social media signalsSt = s1 s2 sj consisting of official Post activitiesby the vendor and the Repost and Comment activities by theusers of the social media platform Similarly each brand hasa daily stream of e-commerce platform activities (EPA) Et =e1 e2 el corresponding to Search Clickthrough andOrder actions on a given day t Some of the social mediausers may also be customers of the e-commerce platform butwe do not attempt to align the user activities across the twoplatforms

We consider the predictive power of the social mediasignals regarding specific e-commerce platform activities iefv(bi t St Et) a discrete volume function of a brand bi

a timestamp t a social media stream St and a type of e-commerce activity stream Et We are in fact interested inaggregated volumes of specific EPA types and consider a timeframe T = [tp tp+h] where the date tp is the day of predictionand tp+h is the prediction horizon h = 1 3 7 (1-day 3-day7-day ahead) Therefore we aim to learn a proxy function fvthat at tp defines a cumulative function

FV (bi T SE) =

t=tp+h983131

t=tp

fv(bi t St Et)

integrating ie summing up fv over T

A Multi-class Predictions

Instead of predicting specific values of FV (bi S T Et) wepredict the distribution of FV in terms of quantile levels thatFV may attain on a given day or a given period of timeEvaluating the performance of quantile predictors enables usto assess whether different social media signals are predictiveof specific EPA levels We cast that as a multi-class categoriza-tion problem using supervised learning where the number ofquantiles q determines the number of classes Given a numberof quantiles q and a training set with values FV (bi S T Et)from the training time period we determine the data pointsFV (bi S T Et) within each quantile ie

Pr(FV (bi S T Et) le fk) = kq forallk isin [1 q minus 1]

and the corresponding ranges of FV values The mini-mummaximum values determine the thresholds for assigninga class label to an instance of FV

Training of the multi-class predictor is based on partitioningthe training set into distribution quantiles with correspondingclass labels and value ranges We use the quantile rangesto assign class labels to FV values of the test set If thetest distribution is broader than the one of the training setFV values below the minimum or above the maximum areassigned to the lowesthighest quantile respectively

Given a time of prediction tp we extract features fromthe social media stream S up to tp and predict FV for theprediction horizon tp+h We consider activities on a daily basisand use tp as 235959 on each day

B Feature Construction

Our feature set is based on the three types of SMA signalsPost Repost and Comment activities For each signal type wegenerate features by calculating statistics from SMA streamsfor the time spans of a pre-specified length In particular weconsider K-day periods with Kisin1 3 5 7 Table III showsthe statistics that we calculate for each signal type ie PostRepost Comment over the specific stream length K TheTheil-Sen estimator [28] is a non-parametric trend detectorthat corresponds to the median over all possible combinationsof slopes over K daily measures of the SMA activity In totalwe construct 22 features for each of the 3 SMA types thetotal of 66 features that characterize a vendorrsquos SMA

TABLE III Description of statistics used as features for eachSMA type over K days before tp For K = 1 we consider

only Featsum

Feature DescriptionFeatsum Sum of activity volume over K-day periodFeatmean Mean activity volume over K-day periodFeatmax Maximum activity volume over K-day periodFeatmin Maximum activity volume over K-day periodFeatvar Variance of activity volume over K-day periodFeatstdev Standard deviation of activity volume over K-day pe-

riodFeatprev Activity volume on previous dayFeattheil Theil-Sen estimator

V EXPERIMENTS

We use Logistic Regression and Random Forest [29]ndash[31] tolearn multi-class classification models for 2 3 and 5 quantiles(2-q 3-q 5-q) and vary the prediction horizon by modelingnext day (1-day) 3-day and 7-day cumulative volumes ofSearch Clickthrough and Orders We evaluate multi-class pre-diction results by calculating standard precision recall and F1statistics However we focus our discussion on the precisionsince the aim is to assess the effectiveness of SMA featuresin predicting volume levels of specific EPA types Thusidentifying true-positive instances within a specific quantileis given a priority over avoiding false-negative instances Infact correct predictions of high (top 20-30) and low(bottom 20-30) quantiles are of particular interest sincetheir detection can improve SMA campaigns and optimize e-commerce operations for high levels of activities that on somedays such as global shopping events can increase 100-fold

Fig 5 Sliding window of a 12-month training and 1-monthtest data with a shift of one calendar month to cover the test

period from Jan 2017 until Dec 2017

1) Temporal Cross-validation All our experiments areperformed using 12-fold time-series cross-validation as shownin Figure 5 with data sets specific to each vendor We trainand test our models using a two year JD data set (SectionIII-A) Our starting training set covers the 12 month periodfrom 1 January 2016 to 31 December 2016 The test set isthe following month In each fold we slide the 12-monthtraining and 1-month test period by one calendar month Thusfor each experiment we use one year of historical data andpredict EPA in every calendar month in 2017 For each vendorwe report the average precision statistics across 12-folds Wecollate and average precision statistics across 33 vendors andacross vendor categories

2) Activity Predictions For a given quantile scale (2-q 3-q5-q) we predict volumes of activities for the next day tp+1 (1-

(a) Precision statistics for 3-q next-day predictions for Orders (b) Precision statistics for 5-q next-day predictions for Orders

Fig 6 Precision of Random Forest predictors for 3-q and 5-q categories for the next-day Orders across 33 vendors The topquantile precisions are higher than random predictions (33 for 3-q and 20 for 5-q)

TABLE IV Precision statistics for Random Forest 5-q classifiers for the next-day volumes of each EPA type OrderClickthrough and Search Results are aggregated per five categories

Vendor 0-20 20-40 40-60 60-80 80-100AVG MAX MIN AVG MAX MIN AVG MAX MIN AVG MAX MIN AVG MAX MIN

Order

PampE 0159 0338 0013 0136 0218 0054 0198 0292 0070 0298 0570 0149 0341 0393 0271Sports 0129 0340 0000 0161 0370 0045 0154 0263 0021 0247 0425 0149 0456 0653 0154Food 0070 0231 0000 0090 0302 0000 0144 0271 0030 0313 0462 0181 0410 0708 0213Clothes 0199 0571 0000 0129 0217 0011 0186 0385 0065 0225 0464 0018 0308 0528 0076Home 0167 0596 0000 0157 0317 0000 0263 0517 0092 0222 0333 0091 0312 0481 0079

Clickthrough

PampE 0371 0548 0192 0200 0311 0099 0147 0241 0025 0159 0353 0000 0265 0565 0055Sports 0462 0631 0275 0264 0393 0094 0144 0328 0038 0135 0348 0024 0148 0286 0037Food 0351 0860 0083 0223 0488 0056 0177 0345 0030 0136 0290 0000 0162 0385 0036Clothes 0312 0571 0013 0165 0260 0030 0199 0426 0012 0145 0308 0000 0192 0500 0023Home 0324 0645 0100 0225 0347 0116 0188 0385 0025 0149 0259 0039 0174 0365 0000

Search

PampE 0521 0679 0336 0231 0353 0065 0109 0293 0029 0077 0176 0000 0255 0481 0070Sports 0483 0763 0276 0248 0350 0103 0130 0222 0000 0081 0173 0000 0126 0250 0066Food 0308 0784 0089 0193 0370 0027 0183 0324 0027 0175 0373 0052 0185 0317 0048Clothes 0329 0628 0194 0251 0353 0114 0232 0313 0131 0127 0238 0020 0119 0288 0043Home 0351 0573 0179 0232 0323 0041 0200 0322 0083 0200 0311 0098 0194 0316 0086

day) and the multi-day cumulative activities over 3-day and7-day periods For each prediction type and each individualvendor we use the corresponding quantile scale determinedover the training data Our experiments thus involve 891predictions for each of 33 vendors 3 EPA types (SearchClickthrough Orders) with three quantile scales (2-q 3-qand 5-q) and 3 time periods (1-day 3-day 7-day) For thisdiscussion we select experiments that shed light on

bull How successful the predictors are in identifying quantilesfor individual EPA types across the vendor sample

bull How well the predictors perform for the cumulative 3-dayand 7-day activities across the vendor sample

bull Which features significantly contribute to the perfor-mance of predictors for the individual EPA type

A Next-day EPA Predictions for 3 and 5 QuantilesExperiments with the next day predictions of EPA for 3-q

and 5-q show that the Random Forest (RF) predictors performhigher than random for Orders in the top quantiles (top 33and 20 respectively) Figure 6 presents RF precision forOrder quantiles for individual vendors Table IV summarizesRF precision statistics for all three EPA types and the fivevendor categories We highlight the RF results that are betterthan random predictions (gt 020 for 5-q) We see thatin addition to the top quantile predictions for Orders theprecision statistics are better than random for Search andClickthrough in the lowest quantile (bottom 33 for 3-q and

bottom 20 for 5-q) All other quantile predictions for OrderClickthrough and Search are on par or lower than random

We conducted the same experiments with Logistic Regres-sion (LR) and found that RF outperforms LR predictors Infact for our sample of product vendors the t-test performedfor LR and RF predictions yield p lt 005 for all the quantilelabels (not just the top quantiles)

B Predictions of Multi-day Cumulative EPA

Considering a selective success of SMA based predictorsfor quantiles of the next-day EPA we train predictors forcumulative EPA ie EPA volumes over 3 and 7 days Weexpect that multi-day cumulative statistics are less volatileand therefore the quantile levels may be more stable andpredictable over time

Our experiments show that RF predictors for 3-day and7-day cumulative EPA are consistently performing at thesimilar order of magnitude for a given quantile level That isillustrated in Table V showing the precision statistics of RFpredictors for 1-day 3-day and 7-day cumulative volumes ofOrder Clickthrough and Search for 3 quantiles (3-q) Similarobservations are made for 5 quantiles and for the LR classifier

We conclude that our high performing predictors for cu-mulative multi-day EPA volumes can be flexibly applied todifferent scenarios that benefit from observing cumulativeEPA across multiple days This includes the precision of topquantile volumes for Orders Figure 7 shows 3-q predictions

TABLE V Three quantile prediction results for five categories of products across 33 vendors based on Random Forestreporting average precision by categories of 1-day (1D) 3-day (3D) and 7-day (7D) cumulative Orders Clickthrough and

Search for each quantile

Vendor 0-33 33-66 66-1001D 3D 7D 1D 3D 7D 1D 3D 7D

Order

PE 0232 0225 0237 0327 0329 0303 0522 0536 0551Sports 0241 0212 0172 0237 0240 0298 0584 0598 0607Food 0141 0139 0133 0288 0230 0251 0663 0656 0675Clothes 0287 0286 0280 0297 0272 0310 0397 0431 0478Home 0261 0266 0244 0375 0383 0351 0453 0467 0464AVG 0226 0222 0209 0310 0293 0302 0528 0540 0556

Clickthrough

PE 0488 0480 0471 0269 0303 0232 0385 0362 0372Sports 0637 0620 0623 0226 0263 0214 0232 0223 0266Food 0464 0471 0476 0282 0285 0286 0263 0266 0285Clothes 0418 0399 0423 0301 0311 0293 0331 0328 0329Home 0475 0487 0484 0327 0307 0289 0285 0278 0256AVG 0488 0486 0490 0288 0295 0271 0291 0286 0293

Search

PE 0597 0621 0655 0182 0194 0150 0277 0259 0289Sports 0621 0613 0616 0266 0248 0263 0166 0157 0152Food 0442 0413 0440 0316 0319 0292 0299 0291 0276Clothes 0455 0479 0476 0375 0370 0355 0225 0205 0172Home 0453 0474 0524 0360 0313 0298 0330 0327 0326AVG 0493 0497 0522 0315 0301 0283 0271 0261 0254

(a) Precision of bottom quantile predictionsfor Orders

(b) Precision of middle quantile predictionsfor Orders

(c) Precision of top quantile prediction forOrders

Fig 7 Precision of Random Forest predictions into three quantiles for 1-day 3-day and 7-day cumulative Orders

of 1-day 3-day and 7-day cumulative Orders for individualvendors

C Feature Significance

Both Random Forest and Logistic Regression allow us toassess the significance of individual feature types in terms oftheir contributions to the prediction decision For illustrationwe present the analysis of features for the RF classifier trainedfor 3-q predictions of the next-day EPA volumes For eachclassifier we calculate the average relative rank of a featureTable VI shows the top 10 contributing features of 1-day EPApredictors for Orders Clickthrough and Search respectivelyacross all the vendors and 3 quantiles

We observe that the relative feature importance is to adegree in agreement with the observations from the corre-lation analysis in Section V-2 Search levels are predictedby Comment and Repost activities considered over differentlengths of time 1 3 5 and 7 days Clickthrough activitiesseem to be aligned with SMA over 7 and 3 day periodsprimarily with consumer comments Orders however areclearly related to Comment volumes over a longer periodsie 5 and 7 days Overall we recognize the importance of

features generated from Comment activities as they contributeto the predictors of all EPA types

TABLE VI Ten top-ranked features of the 3-q RandomForest predictors for the next-day EPA

Rank Order Clickthrough Search1 7DCommentmin 7DCommentmin 7DCommentmin

2 7DCommentsum 3DCommentmean 5DRepostmin

3 7DCommentmean 3DCommentsum PreviousDComment4 7DCommentmax PreviousDComment 7DCommentstd5 5DCommentmean 3DCommentmax 3DCommentmean

6 5DCommentmax 7DCommentsum 7DRepoststd7 5DCommentsum 7DCommentmean 5DRepostsum8 7DCommentstd 7DPostsum 3DCommentsum9 7DCommentvar 3DCommentmin 7DCommentvar10 3DCommentmean 5DCommentvar 5DCommentvar

VI CONCLUDING REMARKS

This paper presents a detailed empirical study of the re-lationship between SMA of product vendors and EPA ofconsumers interested in the vendorsrsquo products The studyis the first to characterize the correlation of specific SMAengagements ie posts reposts and comments and EPA typesthat correspond to specific stages in the consumersrsquo purchasedecisions ie search for brands and products clickthrough

product information and product orders Our analyses un-covered low-to-moderate correlations between the volumes ofSMA-EPA pairs suggesting that predicting daily volumes ofEPA based on SMA volumes alone is not well supported How-ever moderate correlations of Post-Search Post-Clickthroughand Comment-Order across vendors suggest that one may beable to train predictors of EPA distributions and their changesrather that the precise daily volumes We thus introduce a newapproach of characterizing SMA-EPA relationship in terms ofEPA quantiles We formulate EPA predictions as a multi-classcategorization problem into 2 3 and 5 quantiles Our RandomForest classifiers outperform both the random predictors andthe Logistic Regression classifiers for top quantiles of Orderand bottom quantiles of Search and Clickthrough Our studyprovides unique insights into the varied correlations of SMA-EPA pairs and a mixed success of SMA-based predictorsin determining EPA levels A general view that social me-dia engagements undoubtedly drive consumer traction on e-commerce platforms is not substantiated by the Sina Weiboand JD case study and requires further analysis In our futurework we will expand explorations of this issue with a broaderset of product categories and SMA analyses that considerproperties of the content exchanged through the SMA

VII ACKNOWLEDGEMENTS

We wish to express our gratitude to JDcom for providingaccess to invaluable data and Bin Xu and Hao Dong for theirassistance We also acknowledge the financial support from theInternational Doctoral Innovation Centre Ningbo EducationBureau Ningbo Science and Technology Bureau and theUniversity of Nottingham This work was also supported bythe UK Engineering and Physical Sciences Research Council[grant number EPL0154631]

REFERENCES

[1] J Yeo S Kim E Koh S-w Hwang and N Lipka ldquoPredicting onlinepurchase conversion for retargetingrdquo in Proceedings of the 10th ACMInternational Conference on Web Search and Data Mining pp 591ndash600ACM 2017

[2] C Lo D Frankowski and J Leskovec ldquoUnderstanding behaviors thatlead to purchasing A case study of pinterestrdquo in Proceedings of the22th ACM SIGKDD International Conference on Knowledge Discoveryand Data Mining pp 531ndash540 2016

[3] W W Moe ldquoBuying searching or browsing Differentiating betweenonline shoppers using in-store navigational clickstreamrdquo Journal ofConsumer Psychology vol 13 no 1-2 pp 29ndash39 2003

[4] C Sismeiro and R E Bucklin ldquoModeling purchase behavior at an e-commerce web site A task-completion approachrdquo Journal of MarketingResearch vol 41 no 3 pp 306ndash323 2004

[5] C Park D Kim J Oh and H Yu ldquoPredicting user purchase in e-commerce by comprehensive feature engineering and decision boundaryfocused under-samplingrdquo in Proceedings of the 2015 International ACMRecommender Systems Challenge p 8 ACM 2015

[6] D Van den Poel and W Buckinx ldquoPredicting online-purchasing be-haviourrdquo European Journal of Operational Research vol 166 no 2pp 557ndash575 2005

[7] D J Bertsimas A J Mersereau and N R Patel ldquoDynamic classifica-tion of online customersrdquo in Proceedings of the 2003 SIAM InternationalConference on Data Mining pp 107ndash118 SIAM 2003

[8] K L Kotler Philip Keller Marketing Management volume 14 PrenticeHall Englewood Cliffs 2015

[9] L Xu J A Duan and A Whinston ldquoPath to purchase A mutuallyexciting point process model for online advertising and conversionrdquoManagement Science vol 60 no 6 pp 1392ndash1412 2014

[10] X Yu Y Liu X Huang and A An ldquoMining online reviews forpredicting sales performance A case study in the movie domainrdquoIEEE Transactions on Knowledge and Data Engineering vol 24 no 4pp 720ndash734 2012

[11] E Young Kim and Y-K Kim ldquoPredicting online purchase intentionsfor clothing productsrdquo European Journal of Marketing vol 38 no 7pp 883ndash897 2004

[12] D Hawkins D Mothersbaugh and R Best ldquoConsumer behaviourBuilding marketing strategy mcgraw hillrdquo 2012

[13] G Kulkarni P Kannan and W Moe ldquoUsing online search data toforecast new product salesrdquo Decision Support Systems vol 52 no 3pp 604ndash611 2012

[14] J Wang W X Zhao H Wei H Yan and X Li ldquoMining new businessopportunities Identifying trend related products by leveraging commer-cial intents from microblogsrdquo in Proceedings of the 2013 Conferenceon Empirical Methods in Natural Language Processing pp 1337ndash13472013

[15] B Hollerit M Kroll and M Strohmaier ldquoTowards linking buyers andsellers detecting commercial intent on twitterrdquo in Proceedings of the22nd International Conference on World Wide Web pp 629ndash632 ACM2013

[16] F Kooti K Lerman L M Aiello M Grbovic N Djuric and V Ra-dosavljevic ldquoPortrait of an online shopper Understanding and predictingconsumer behaviorrdquo in Proceedings of the 9th ACM InternationalConference on Web Search and Data Mining pp 205ndash214 ACM 2016

[17] J Leskovec L A Adamic and B A Huberman ldquoThe dynamics ofviral marketingrdquo ACM Transactions on The Web (TWEB) vol 1 no 1p 5 2007

[18] A Anagnostopoulos R Kumar and M Mahdian ldquoInfluence and cor-relation in social networksrdquo in Proceedings of the 14th ACM SIGKDDInternational Conference on Knowledge Discovery and Data Miningpp 7ndash15 ACM 2008

[19] D Crandall D Cosley D Huttenlocher J Kleinberg and S SurildquoFeedback effects between similarity and social influence in onlinecommunitiesrdquo in Proceedings of the 14th ACM SIGKDD InternationalConference on Knowledge Discovery and Data Mining pp 160ndash168ACM 2008

[20] C R Shalizi and A C Thomas ldquoHomophily and contagion are generi-cally confounded in observational social network studiesrdquo SociologicalMethods amp Research vol 40 no 2 pp 211ndash239 2011

[21] V Gupta D Varshney H Jhamtani D Kedia and S Karwa ldquoIdenti-fying purchase intent from social postsrdquo in ICWSM 2014

[22] Y Zhang and M Pennacchiotti ldquoPredicting purchase behaviors fromsocial mediardquo in Proceedings of the 22nd International Conference onWorld Wide Web pp 1521ndash1532 ACM 2013

[23] X W Zhao Y Guo Y He H Jiang Y Wu and X Li ldquoWe knowwhat you want to buy a demographic-based system for product recom-mendation on microblogsrdquo in Proceedings of the 20th ACM SIGKDDInternational Conference on Knowledge Discovery and Data Miningpp 1935ndash1944 ACM 2014

[24] R Crane and D Sornette ldquoRobust dynamic classes revealed by mea-suring the response function of a social systemrdquo Proceedings of theNational Academy of Sciences vol 105 no 41 pp 15649ndash15653 2008

[25] J Lehmann B Goncalves J J Ramasco and C Cattuto ldquoDynamicalclasses of collective attention in twitterrdquo in Proceedings of the 21stInternational Conference on World Wide Web pp 251ndash260 ACM 2012

[26] D M Romero B Meeder and J Kleinberg ldquoDifferences in the mechan-ics of information diffusion across topics idioms political hashtags andcomplex contagion on twitterrdquo in Proceedings of the 20th InternationalConference on World Wide Web pp 695ndash704 ACM 2011

[27] J Yang and J Leskovec ldquoPatterns of temporal variation in onlinemediardquo in Proceedings of the 4th ACM International Conference onWeb Search and Data Mining pp 177ndash186 ACM 2011

[28] M G Akritas S A Murphy and M P Lavalley ldquoThe theil-senestimator with doubly censored data and applications to astronomyrdquoJournal of the American Statistical Association vol 90 no 429pp 170ndash177 1995

[29] D W Hosmer Jr S Lemeshow and R X Sturdivant Applied logisticregression vol 398 John Wiley amp Sons 2013

[30] H-F Yu F-L Huang and C-J Lin ldquoDual coordinate descent methodsfor logistic regression and maximum entropy modelsrdquo Machine Learn-ing vol 85 no 1 pp 41ndash75 2011

[31] A Liaw M Wiener et al ldquoClassification and regression by random-forestrdquo R News vol 2 no 3 pp 18ndash22 2002

Page 3: Social Media Brand Engagement as a Proxy for E-commerce ...eprints.nottingham.ac.uk/55994/1/Social Media Brand... · of tools and services for marketeers to manage their Social Media

optimizing the platform operations Our research is conductedin collaboration with JD1 with access to anonymized dataabout product sales of vendors in different product categoriesFurthermore we include information about vendorsrsquo socialengagements on Sina Weibo2 platform that hosts Web sitesof many JD vendors The vendors share post and engage thecommunity through comment and discussions We align SinaWeibo data and JD logs to understand how the vendorsrsquo SocialMedia Activities (SMA) on Weibo relate to the E-commercePlatform Activities (EPA) of JD users focusing on patternsof interaction within both services We perform a macro-level analysis considering time-series of actions analyses ofindividual usersrsquo behavior or content shared in social mediaare not part of this study

TABLE I Total social media activities of the sampledvendors within each product category for the two year period

(Jan 2016-Dec 2017)

Category Number of Vendors Post Comment RepostPhoneampEl 4 37312 5143200 2629674Sports 5 6927 295952 230733Food 9 17666 1072441 496664Clothes 6 9456 227258 155334Home 9 53857 1902164 681071

Fig 2 Total Post Comment and Repost statistics normalizedby the corresponding maxima over the two year period (Jan

2016-Dec 2017)A Data

Our EPA data set comprises JD records of (1) Search forbrands (2) Clickthrough and (3) Orders Search queries areincluded only if explicitly mention a vendorrsquos name The datawas collected for a sample of 50 most popular JD vendorsacross 5 product categories PhoneampElectronics Sports FoodClothes and Home on JDrsquos platform The vendorrsquos popularityranking is based on the JDrsquos sales performance metrics thatis treated as business confidential Among 50 companies weidentified 33 that have had a Weibo account for at least two

1JDcom is one of the largest e-commerce platforms in China2Weibocom is the largest social media platform in China with social media

features that combine those of Facebook (wwwfacebookcom) and Twitter(wwwtwittercom)

Fig 3 Total Search Clickthrough and Order volumes ofindividual vendors normalized by the daily maxima over the

two year period (Jan 2016-Dec 2017)

years from Jan 2016 until Dec 2017 and published at least 10posts over that period The final sample comprises 4 vendorsin PhoneampElectronics (PE) 5 in Sports 9 in Foods 6 inClothes and 9 in Home For each vendor we collected (1)Post (2) Repost and (3) Comment statistics

The EPA and SMA data is segmented per calendar dateTable I shows the SMA distribution per vendor category andFigure 2 plots the total SMA volumes of Post Repost andComment activities for individual vendors on a logarithmicscale As expected the Repost and Comment statistics aretwo orders of magnitude higher than Post illustrating astrong social media effect in response to the vendorsrsquo postingactivities Considering the SMA statistics across five categories(Figure 2) the vendors in the PhoneampElectronics categoryappear to gain most traction through repost and commentingactivities Sports category ranks lowest on posts while theClothes category is the lowest when considering the sum ofall three SMA statistics

Figure 3 presents the total Search Comment and Orderstatistics for individual vendors normalized by the maximumdaily volume over the period of two years It shows that onlinepurchases incur Search and Clickthrough volumes with 1-2orders of magnitude higher than Order volumes as they playimportant parts in the purchase decision process

B Correlation Analysis

We calculate the pairwise Pearson Correlation betweenSMA (Post Repost Comment) and EPA (Search Click-through Order) for the two year time-series corresponding tothe individual vendors Table II shows the highest correlationcoefficient for each SMA-EPA pair among all 33 vendorsIt confirms low-to-medium correlations of SMA-EPA acrossvendors with Post-Search having the highest maximum coef-ficient of 056 The maximum of 011 for Repost-Clickthroughand 008 for Repost-Orders indicate low correlations betweenthese SMA and EPA types across all the vendors

Heat maps in Figure 4 present the pairwise SMA-EPA cor-relation coefficients for each of the 33 vendors across the fulltwo year period by considering the next-day EPA statistics for

(a) Pearson correlations between SMA onday tp and Search on tp+1

(b) Pearson correlations between SMA onday tp and Clickthrough on tp+1

(c) Pearson correlations between SMA onday tp and Order on tp+1

Fig 4 Pearson Correlation between social media activities (Post Repost and Comment) and specific e-commerce platformactivities (Search Clickthrough and Order) for 33 vendors using their two year time series Heat maps show low to medium

correlations between SMA and EPA pairs

TABLE II The highest correlation coefficients acrossSMA-EPA pairs for the sample of 33 vendors

Post Comment RepostClickthrough 039 023 011Search 056 020 023Order 025 021 008

a given SMA day statistics Comparing the relative importanceof SMA types it appears that the Post statistics are morehighly correlated with Search and Clickthrough driving thebrand awareness and product interest Considering Commentstatistics across vendors they are more highly correlatedwith Clickthrough and Orders than with Search suggestingthat comments on posts and reposts may help with purchasedecisions Our analysis is the first to offer empirical evidencethat different types of social media engagements relate todifferent aspects of online shopping activities

We further consider SMA-EPA correlations for differenttime frames and time lags

1) Rolling Correlation Analysis We calculate the PearsonCorrelation Coefficient for SME-EPA pairs using the rolling30-day window over the period of two years (instead of thesingle two-year time-series) and show that the correlationcoefficients for daily statistics varies over time This is illus-trated in Figure 1 (Section I) showing time variations of thecorrelation coefficients for the vendor with the highest volumeof comments Without stable correlations it would be hard forexample to use Comment features to predict Orders

2) SMA-EPA Lag Analysis In our scenario SMA andEPA occur on different platforms ie Sina Weibo and JDrespectively Thus even if SMA has an effect on EPA onecan expect delays depending on the speed of social mediapropagation For that reason we investigate rolling SMA-EPAcorrelations with different day lags allowing for 1-15 daydelay By repeating the 30-day rolling window calculationswith 1-15 day shift in time series we observe a weekly patternin the rolling correlation with a spike every 7 days Howevereven for the vendor with the highest volume of Comment

activities the absolute correlation values are always below05 Generally the volumes of Post activities have the highestcorrelation with the volumes of Search or Clickthrough whileComment data has the highest correlation with Order

C Problem Formulation

Concluding from the presented analysis the task of pre-dicting daily volumes of e-commerce activities from levels ofsocial media activities is not well supported by the correla-tion analysis of daily statistics However in many practicalscenarios it is sufficient to predict a trend ie a change inthe volume range anticipating highs and lows that may affectproduct supplies and platform logistics Furthermore insteadof daily activities it is sufficient to predict purchase outcomesover a given period of time eg a cumulative sales for 3-7ahead Thus in Section IV we formulate the EPA predictionproblem by considering (a) discretized distributions of EPAvolumes for each vendor into 2 3 and 5 quantiles and (b)quantile predictions of EPA totals for the next day next 3days and next 7 days

IV PREDICTING E-COMMERCE ACTIVITIES

Let us now consider a set of vendors represented by theirbrands B = b1 b2 bi Without a loss of generalitywe can assume that each vendor is represented by a singlebrand that has a specific daily stream of social media signalsSt = s1 s2 sj consisting of official Post activitiesby the vendor and the Repost and Comment activities by theusers of the social media platform Similarly each brand hasa daily stream of e-commerce platform activities (EPA) Et =e1 e2 el corresponding to Search Clickthrough andOrder actions on a given day t Some of the social mediausers may also be customers of the e-commerce platform butwe do not attempt to align the user activities across the twoplatforms

We consider the predictive power of the social mediasignals regarding specific e-commerce platform activities iefv(bi t St Et) a discrete volume function of a brand bi

a timestamp t a social media stream St and a type of e-commerce activity stream Et We are in fact interested inaggregated volumes of specific EPA types and consider a timeframe T = [tp tp+h] where the date tp is the day of predictionand tp+h is the prediction horizon h = 1 3 7 (1-day 3-day7-day ahead) Therefore we aim to learn a proxy function fvthat at tp defines a cumulative function

FV (bi T SE) =

t=tp+h983131

t=tp

fv(bi t St Et)

integrating ie summing up fv over T

A Multi-class Predictions

Instead of predicting specific values of FV (bi S T Et) wepredict the distribution of FV in terms of quantile levels thatFV may attain on a given day or a given period of timeEvaluating the performance of quantile predictors enables usto assess whether different social media signals are predictiveof specific EPA levels We cast that as a multi-class categoriza-tion problem using supervised learning where the number ofquantiles q determines the number of classes Given a numberof quantiles q and a training set with values FV (bi S T Et)from the training time period we determine the data pointsFV (bi S T Et) within each quantile ie

Pr(FV (bi S T Et) le fk) = kq forallk isin [1 q minus 1]

and the corresponding ranges of FV values The mini-mummaximum values determine the thresholds for assigninga class label to an instance of FV

Training of the multi-class predictor is based on partitioningthe training set into distribution quantiles with correspondingclass labels and value ranges We use the quantile rangesto assign class labels to FV values of the test set If thetest distribution is broader than the one of the training setFV values below the minimum or above the maximum areassigned to the lowesthighest quantile respectively

Given a time of prediction tp we extract features fromthe social media stream S up to tp and predict FV for theprediction horizon tp+h We consider activities on a daily basisand use tp as 235959 on each day

B Feature Construction

Our feature set is based on the three types of SMA signalsPost Repost and Comment activities For each signal type wegenerate features by calculating statistics from SMA streamsfor the time spans of a pre-specified length In particular weconsider K-day periods with Kisin1 3 5 7 Table III showsthe statistics that we calculate for each signal type ie PostRepost Comment over the specific stream length K TheTheil-Sen estimator [28] is a non-parametric trend detectorthat corresponds to the median over all possible combinationsof slopes over K daily measures of the SMA activity In totalwe construct 22 features for each of the 3 SMA types thetotal of 66 features that characterize a vendorrsquos SMA

TABLE III Description of statistics used as features for eachSMA type over K days before tp For K = 1 we consider

only Featsum

Feature DescriptionFeatsum Sum of activity volume over K-day periodFeatmean Mean activity volume over K-day periodFeatmax Maximum activity volume over K-day periodFeatmin Maximum activity volume over K-day periodFeatvar Variance of activity volume over K-day periodFeatstdev Standard deviation of activity volume over K-day pe-

riodFeatprev Activity volume on previous dayFeattheil Theil-Sen estimator

V EXPERIMENTS

We use Logistic Regression and Random Forest [29]ndash[31] tolearn multi-class classification models for 2 3 and 5 quantiles(2-q 3-q 5-q) and vary the prediction horizon by modelingnext day (1-day) 3-day and 7-day cumulative volumes ofSearch Clickthrough and Orders We evaluate multi-class pre-diction results by calculating standard precision recall and F1statistics However we focus our discussion on the precisionsince the aim is to assess the effectiveness of SMA featuresin predicting volume levels of specific EPA types Thusidentifying true-positive instances within a specific quantileis given a priority over avoiding false-negative instances Infact correct predictions of high (top 20-30) and low(bottom 20-30) quantiles are of particular interest sincetheir detection can improve SMA campaigns and optimize e-commerce operations for high levels of activities that on somedays such as global shopping events can increase 100-fold

Fig 5 Sliding window of a 12-month training and 1-monthtest data with a shift of one calendar month to cover the test

period from Jan 2017 until Dec 2017

1) Temporal Cross-validation All our experiments areperformed using 12-fold time-series cross-validation as shownin Figure 5 with data sets specific to each vendor We trainand test our models using a two year JD data set (SectionIII-A) Our starting training set covers the 12 month periodfrom 1 January 2016 to 31 December 2016 The test set isthe following month In each fold we slide the 12-monthtraining and 1-month test period by one calendar month Thusfor each experiment we use one year of historical data andpredict EPA in every calendar month in 2017 For each vendorwe report the average precision statistics across 12-folds Wecollate and average precision statistics across 33 vendors andacross vendor categories

2) Activity Predictions For a given quantile scale (2-q 3-q5-q) we predict volumes of activities for the next day tp+1 (1-

(a) Precision statistics for 3-q next-day predictions for Orders (b) Precision statistics for 5-q next-day predictions for Orders

Fig 6 Precision of Random Forest predictors for 3-q and 5-q categories for the next-day Orders across 33 vendors The topquantile precisions are higher than random predictions (33 for 3-q and 20 for 5-q)

TABLE IV Precision statistics for Random Forest 5-q classifiers for the next-day volumes of each EPA type OrderClickthrough and Search Results are aggregated per five categories

Vendor 0-20 20-40 40-60 60-80 80-100AVG MAX MIN AVG MAX MIN AVG MAX MIN AVG MAX MIN AVG MAX MIN

Order

PampE 0159 0338 0013 0136 0218 0054 0198 0292 0070 0298 0570 0149 0341 0393 0271Sports 0129 0340 0000 0161 0370 0045 0154 0263 0021 0247 0425 0149 0456 0653 0154Food 0070 0231 0000 0090 0302 0000 0144 0271 0030 0313 0462 0181 0410 0708 0213Clothes 0199 0571 0000 0129 0217 0011 0186 0385 0065 0225 0464 0018 0308 0528 0076Home 0167 0596 0000 0157 0317 0000 0263 0517 0092 0222 0333 0091 0312 0481 0079

Clickthrough

PampE 0371 0548 0192 0200 0311 0099 0147 0241 0025 0159 0353 0000 0265 0565 0055Sports 0462 0631 0275 0264 0393 0094 0144 0328 0038 0135 0348 0024 0148 0286 0037Food 0351 0860 0083 0223 0488 0056 0177 0345 0030 0136 0290 0000 0162 0385 0036Clothes 0312 0571 0013 0165 0260 0030 0199 0426 0012 0145 0308 0000 0192 0500 0023Home 0324 0645 0100 0225 0347 0116 0188 0385 0025 0149 0259 0039 0174 0365 0000

Search

PampE 0521 0679 0336 0231 0353 0065 0109 0293 0029 0077 0176 0000 0255 0481 0070Sports 0483 0763 0276 0248 0350 0103 0130 0222 0000 0081 0173 0000 0126 0250 0066Food 0308 0784 0089 0193 0370 0027 0183 0324 0027 0175 0373 0052 0185 0317 0048Clothes 0329 0628 0194 0251 0353 0114 0232 0313 0131 0127 0238 0020 0119 0288 0043Home 0351 0573 0179 0232 0323 0041 0200 0322 0083 0200 0311 0098 0194 0316 0086

day) and the multi-day cumulative activities over 3-day and7-day periods For each prediction type and each individualvendor we use the corresponding quantile scale determinedover the training data Our experiments thus involve 891predictions for each of 33 vendors 3 EPA types (SearchClickthrough Orders) with three quantile scales (2-q 3-qand 5-q) and 3 time periods (1-day 3-day 7-day) For thisdiscussion we select experiments that shed light on

bull How successful the predictors are in identifying quantilesfor individual EPA types across the vendor sample

bull How well the predictors perform for the cumulative 3-dayand 7-day activities across the vendor sample

bull Which features significantly contribute to the perfor-mance of predictors for the individual EPA type

A Next-day EPA Predictions for 3 and 5 QuantilesExperiments with the next day predictions of EPA for 3-q

and 5-q show that the Random Forest (RF) predictors performhigher than random for Orders in the top quantiles (top 33and 20 respectively) Figure 6 presents RF precision forOrder quantiles for individual vendors Table IV summarizesRF precision statistics for all three EPA types and the fivevendor categories We highlight the RF results that are betterthan random predictions (gt 020 for 5-q) We see thatin addition to the top quantile predictions for Orders theprecision statistics are better than random for Search andClickthrough in the lowest quantile (bottom 33 for 3-q and

bottom 20 for 5-q) All other quantile predictions for OrderClickthrough and Search are on par or lower than random

We conducted the same experiments with Logistic Regres-sion (LR) and found that RF outperforms LR predictors Infact for our sample of product vendors the t-test performedfor LR and RF predictions yield p lt 005 for all the quantilelabels (not just the top quantiles)

B Predictions of Multi-day Cumulative EPA

Considering a selective success of SMA based predictorsfor quantiles of the next-day EPA we train predictors forcumulative EPA ie EPA volumes over 3 and 7 days Weexpect that multi-day cumulative statistics are less volatileand therefore the quantile levels may be more stable andpredictable over time

Our experiments show that RF predictors for 3-day and7-day cumulative EPA are consistently performing at thesimilar order of magnitude for a given quantile level That isillustrated in Table V showing the precision statistics of RFpredictors for 1-day 3-day and 7-day cumulative volumes ofOrder Clickthrough and Search for 3 quantiles (3-q) Similarobservations are made for 5 quantiles and for the LR classifier

We conclude that our high performing predictors for cu-mulative multi-day EPA volumes can be flexibly applied todifferent scenarios that benefit from observing cumulativeEPA across multiple days This includes the precision of topquantile volumes for Orders Figure 7 shows 3-q predictions

TABLE V Three quantile prediction results for five categories of products across 33 vendors based on Random Forestreporting average precision by categories of 1-day (1D) 3-day (3D) and 7-day (7D) cumulative Orders Clickthrough and

Search for each quantile

Vendor 0-33 33-66 66-1001D 3D 7D 1D 3D 7D 1D 3D 7D

Order

PE 0232 0225 0237 0327 0329 0303 0522 0536 0551Sports 0241 0212 0172 0237 0240 0298 0584 0598 0607Food 0141 0139 0133 0288 0230 0251 0663 0656 0675Clothes 0287 0286 0280 0297 0272 0310 0397 0431 0478Home 0261 0266 0244 0375 0383 0351 0453 0467 0464AVG 0226 0222 0209 0310 0293 0302 0528 0540 0556

Clickthrough

PE 0488 0480 0471 0269 0303 0232 0385 0362 0372Sports 0637 0620 0623 0226 0263 0214 0232 0223 0266Food 0464 0471 0476 0282 0285 0286 0263 0266 0285Clothes 0418 0399 0423 0301 0311 0293 0331 0328 0329Home 0475 0487 0484 0327 0307 0289 0285 0278 0256AVG 0488 0486 0490 0288 0295 0271 0291 0286 0293

Search

PE 0597 0621 0655 0182 0194 0150 0277 0259 0289Sports 0621 0613 0616 0266 0248 0263 0166 0157 0152Food 0442 0413 0440 0316 0319 0292 0299 0291 0276Clothes 0455 0479 0476 0375 0370 0355 0225 0205 0172Home 0453 0474 0524 0360 0313 0298 0330 0327 0326AVG 0493 0497 0522 0315 0301 0283 0271 0261 0254

(a) Precision of bottom quantile predictionsfor Orders

(b) Precision of middle quantile predictionsfor Orders

(c) Precision of top quantile prediction forOrders

Fig 7 Precision of Random Forest predictions into three quantiles for 1-day 3-day and 7-day cumulative Orders

of 1-day 3-day and 7-day cumulative Orders for individualvendors

C Feature Significance

Both Random Forest and Logistic Regression allow us toassess the significance of individual feature types in terms oftheir contributions to the prediction decision For illustrationwe present the analysis of features for the RF classifier trainedfor 3-q predictions of the next-day EPA volumes For eachclassifier we calculate the average relative rank of a featureTable VI shows the top 10 contributing features of 1-day EPApredictors for Orders Clickthrough and Search respectivelyacross all the vendors and 3 quantiles

We observe that the relative feature importance is to adegree in agreement with the observations from the corre-lation analysis in Section V-2 Search levels are predictedby Comment and Repost activities considered over differentlengths of time 1 3 5 and 7 days Clickthrough activitiesseem to be aligned with SMA over 7 and 3 day periodsprimarily with consumer comments Orders however areclearly related to Comment volumes over a longer periodsie 5 and 7 days Overall we recognize the importance of

features generated from Comment activities as they contributeto the predictors of all EPA types

TABLE VI Ten top-ranked features of the 3-q RandomForest predictors for the next-day EPA

Rank Order Clickthrough Search1 7DCommentmin 7DCommentmin 7DCommentmin

2 7DCommentsum 3DCommentmean 5DRepostmin

3 7DCommentmean 3DCommentsum PreviousDComment4 7DCommentmax PreviousDComment 7DCommentstd5 5DCommentmean 3DCommentmax 3DCommentmean

6 5DCommentmax 7DCommentsum 7DRepoststd7 5DCommentsum 7DCommentmean 5DRepostsum8 7DCommentstd 7DPostsum 3DCommentsum9 7DCommentvar 3DCommentmin 7DCommentvar10 3DCommentmean 5DCommentvar 5DCommentvar

VI CONCLUDING REMARKS

This paper presents a detailed empirical study of the re-lationship between SMA of product vendors and EPA ofconsumers interested in the vendorsrsquo products The studyis the first to characterize the correlation of specific SMAengagements ie posts reposts and comments and EPA typesthat correspond to specific stages in the consumersrsquo purchasedecisions ie search for brands and products clickthrough

product information and product orders Our analyses un-covered low-to-moderate correlations between the volumes ofSMA-EPA pairs suggesting that predicting daily volumes ofEPA based on SMA volumes alone is not well supported How-ever moderate correlations of Post-Search Post-Clickthroughand Comment-Order across vendors suggest that one may beable to train predictors of EPA distributions and their changesrather that the precise daily volumes We thus introduce a newapproach of characterizing SMA-EPA relationship in terms ofEPA quantiles We formulate EPA predictions as a multi-classcategorization problem into 2 3 and 5 quantiles Our RandomForest classifiers outperform both the random predictors andthe Logistic Regression classifiers for top quantiles of Orderand bottom quantiles of Search and Clickthrough Our studyprovides unique insights into the varied correlations of SMA-EPA pairs and a mixed success of SMA-based predictorsin determining EPA levels A general view that social me-dia engagements undoubtedly drive consumer traction on e-commerce platforms is not substantiated by the Sina Weiboand JD case study and requires further analysis In our futurework we will expand explorations of this issue with a broaderset of product categories and SMA analyses that considerproperties of the content exchanged through the SMA

VII ACKNOWLEDGEMENTS

We wish to express our gratitude to JDcom for providingaccess to invaluable data and Bin Xu and Hao Dong for theirassistance We also acknowledge the financial support from theInternational Doctoral Innovation Centre Ningbo EducationBureau Ningbo Science and Technology Bureau and theUniversity of Nottingham This work was also supported bythe UK Engineering and Physical Sciences Research Council[grant number EPL0154631]

REFERENCES

[1] J Yeo S Kim E Koh S-w Hwang and N Lipka ldquoPredicting onlinepurchase conversion for retargetingrdquo in Proceedings of the 10th ACMInternational Conference on Web Search and Data Mining pp 591ndash600ACM 2017

[2] C Lo D Frankowski and J Leskovec ldquoUnderstanding behaviors thatlead to purchasing A case study of pinterestrdquo in Proceedings of the22th ACM SIGKDD International Conference on Knowledge Discoveryand Data Mining pp 531ndash540 2016

[3] W W Moe ldquoBuying searching or browsing Differentiating betweenonline shoppers using in-store navigational clickstreamrdquo Journal ofConsumer Psychology vol 13 no 1-2 pp 29ndash39 2003

[4] C Sismeiro and R E Bucklin ldquoModeling purchase behavior at an e-commerce web site A task-completion approachrdquo Journal of MarketingResearch vol 41 no 3 pp 306ndash323 2004

[5] C Park D Kim J Oh and H Yu ldquoPredicting user purchase in e-commerce by comprehensive feature engineering and decision boundaryfocused under-samplingrdquo in Proceedings of the 2015 International ACMRecommender Systems Challenge p 8 ACM 2015

[6] D Van den Poel and W Buckinx ldquoPredicting online-purchasing be-haviourrdquo European Journal of Operational Research vol 166 no 2pp 557ndash575 2005

[7] D J Bertsimas A J Mersereau and N R Patel ldquoDynamic classifica-tion of online customersrdquo in Proceedings of the 2003 SIAM InternationalConference on Data Mining pp 107ndash118 SIAM 2003

[8] K L Kotler Philip Keller Marketing Management volume 14 PrenticeHall Englewood Cliffs 2015

[9] L Xu J A Duan and A Whinston ldquoPath to purchase A mutuallyexciting point process model for online advertising and conversionrdquoManagement Science vol 60 no 6 pp 1392ndash1412 2014

[10] X Yu Y Liu X Huang and A An ldquoMining online reviews forpredicting sales performance A case study in the movie domainrdquoIEEE Transactions on Knowledge and Data Engineering vol 24 no 4pp 720ndash734 2012

[11] E Young Kim and Y-K Kim ldquoPredicting online purchase intentionsfor clothing productsrdquo European Journal of Marketing vol 38 no 7pp 883ndash897 2004

[12] D Hawkins D Mothersbaugh and R Best ldquoConsumer behaviourBuilding marketing strategy mcgraw hillrdquo 2012

[13] G Kulkarni P Kannan and W Moe ldquoUsing online search data toforecast new product salesrdquo Decision Support Systems vol 52 no 3pp 604ndash611 2012

[14] J Wang W X Zhao H Wei H Yan and X Li ldquoMining new businessopportunities Identifying trend related products by leveraging commer-cial intents from microblogsrdquo in Proceedings of the 2013 Conferenceon Empirical Methods in Natural Language Processing pp 1337ndash13472013

[15] B Hollerit M Kroll and M Strohmaier ldquoTowards linking buyers andsellers detecting commercial intent on twitterrdquo in Proceedings of the22nd International Conference on World Wide Web pp 629ndash632 ACM2013

[16] F Kooti K Lerman L M Aiello M Grbovic N Djuric and V Ra-dosavljevic ldquoPortrait of an online shopper Understanding and predictingconsumer behaviorrdquo in Proceedings of the 9th ACM InternationalConference on Web Search and Data Mining pp 205ndash214 ACM 2016

[17] J Leskovec L A Adamic and B A Huberman ldquoThe dynamics ofviral marketingrdquo ACM Transactions on The Web (TWEB) vol 1 no 1p 5 2007

[18] A Anagnostopoulos R Kumar and M Mahdian ldquoInfluence and cor-relation in social networksrdquo in Proceedings of the 14th ACM SIGKDDInternational Conference on Knowledge Discovery and Data Miningpp 7ndash15 ACM 2008

[19] D Crandall D Cosley D Huttenlocher J Kleinberg and S SurildquoFeedback effects between similarity and social influence in onlinecommunitiesrdquo in Proceedings of the 14th ACM SIGKDD InternationalConference on Knowledge Discovery and Data Mining pp 160ndash168ACM 2008

[20] C R Shalizi and A C Thomas ldquoHomophily and contagion are generi-cally confounded in observational social network studiesrdquo SociologicalMethods amp Research vol 40 no 2 pp 211ndash239 2011

[21] V Gupta D Varshney H Jhamtani D Kedia and S Karwa ldquoIdenti-fying purchase intent from social postsrdquo in ICWSM 2014

[22] Y Zhang and M Pennacchiotti ldquoPredicting purchase behaviors fromsocial mediardquo in Proceedings of the 22nd International Conference onWorld Wide Web pp 1521ndash1532 ACM 2013

[23] X W Zhao Y Guo Y He H Jiang Y Wu and X Li ldquoWe knowwhat you want to buy a demographic-based system for product recom-mendation on microblogsrdquo in Proceedings of the 20th ACM SIGKDDInternational Conference on Knowledge Discovery and Data Miningpp 1935ndash1944 ACM 2014

[24] R Crane and D Sornette ldquoRobust dynamic classes revealed by mea-suring the response function of a social systemrdquo Proceedings of theNational Academy of Sciences vol 105 no 41 pp 15649ndash15653 2008

[25] J Lehmann B Goncalves J J Ramasco and C Cattuto ldquoDynamicalclasses of collective attention in twitterrdquo in Proceedings of the 21stInternational Conference on World Wide Web pp 251ndash260 ACM 2012

[26] D M Romero B Meeder and J Kleinberg ldquoDifferences in the mechan-ics of information diffusion across topics idioms political hashtags andcomplex contagion on twitterrdquo in Proceedings of the 20th InternationalConference on World Wide Web pp 695ndash704 ACM 2011

[27] J Yang and J Leskovec ldquoPatterns of temporal variation in onlinemediardquo in Proceedings of the 4th ACM International Conference onWeb Search and Data Mining pp 177ndash186 ACM 2011

[28] M G Akritas S A Murphy and M P Lavalley ldquoThe theil-senestimator with doubly censored data and applications to astronomyrdquoJournal of the American Statistical Association vol 90 no 429pp 170ndash177 1995

[29] D W Hosmer Jr S Lemeshow and R X Sturdivant Applied logisticregression vol 398 John Wiley amp Sons 2013

[30] H-F Yu F-L Huang and C-J Lin ldquoDual coordinate descent methodsfor logistic regression and maximum entropy modelsrdquo Machine Learn-ing vol 85 no 1 pp 41ndash75 2011

[31] A Liaw M Wiener et al ldquoClassification and regression by random-forestrdquo R News vol 2 no 3 pp 18ndash22 2002

Page 4: Social Media Brand Engagement as a Proxy for E-commerce ...eprints.nottingham.ac.uk/55994/1/Social Media Brand... · of tools and services for marketeers to manage their Social Media

(a) Pearson correlations between SMA onday tp and Search on tp+1

(b) Pearson correlations between SMA onday tp and Clickthrough on tp+1

(c) Pearson correlations between SMA onday tp and Order on tp+1

Fig 4 Pearson Correlation between social media activities (Post Repost and Comment) and specific e-commerce platformactivities (Search Clickthrough and Order) for 33 vendors using their two year time series Heat maps show low to medium

correlations between SMA and EPA pairs

TABLE II The highest correlation coefficients acrossSMA-EPA pairs for the sample of 33 vendors

Post Comment RepostClickthrough 039 023 011Search 056 020 023Order 025 021 008

a given SMA day statistics Comparing the relative importanceof SMA types it appears that the Post statistics are morehighly correlated with Search and Clickthrough driving thebrand awareness and product interest Considering Commentstatistics across vendors they are more highly correlatedwith Clickthrough and Orders than with Search suggestingthat comments on posts and reposts may help with purchasedecisions Our analysis is the first to offer empirical evidencethat different types of social media engagements relate todifferent aspects of online shopping activities

We further consider SMA-EPA correlations for differenttime frames and time lags

1) Rolling Correlation Analysis We calculate the PearsonCorrelation Coefficient for SME-EPA pairs using the rolling30-day window over the period of two years (instead of thesingle two-year time-series) and show that the correlationcoefficients for daily statistics varies over time This is illus-trated in Figure 1 (Section I) showing time variations of thecorrelation coefficients for the vendor with the highest volumeof comments Without stable correlations it would be hard forexample to use Comment features to predict Orders

2) SMA-EPA Lag Analysis In our scenario SMA andEPA occur on different platforms ie Sina Weibo and JDrespectively Thus even if SMA has an effect on EPA onecan expect delays depending on the speed of social mediapropagation For that reason we investigate rolling SMA-EPAcorrelations with different day lags allowing for 1-15 daydelay By repeating the 30-day rolling window calculationswith 1-15 day shift in time series we observe a weekly patternin the rolling correlation with a spike every 7 days Howevereven for the vendor with the highest volume of Comment

activities the absolute correlation values are always below05 Generally the volumes of Post activities have the highestcorrelation with the volumes of Search or Clickthrough whileComment data has the highest correlation with Order

C Problem Formulation

Concluding from the presented analysis the task of pre-dicting daily volumes of e-commerce activities from levels ofsocial media activities is not well supported by the correla-tion analysis of daily statistics However in many practicalscenarios it is sufficient to predict a trend ie a change inthe volume range anticipating highs and lows that may affectproduct supplies and platform logistics Furthermore insteadof daily activities it is sufficient to predict purchase outcomesover a given period of time eg a cumulative sales for 3-7ahead Thus in Section IV we formulate the EPA predictionproblem by considering (a) discretized distributions of EPAvolumes for each vendor into 2 3 and 5 quantiles and (b)quantile predictions of EPA totals for the next day next 3days and next 7 days

IV PREDICTING E-COMMERCE ACTIVITIES

Let us now consider a set of vendors represented by theirbrands B = b1 b2 bi Without a loss of generalitywe can assume that each vendor is represented by a singlebrand that has a specific daily stream of social media signalsSt = s1 s2 sj consisting of official Post activitiesby the vendor and the Repost and Comment activities by theusers of the social media platform Similarly each brand hasa daily stream of e-commerce platform activities (EPA) Et =e1 e2 el corresponding to Search Clickthrough andOrder actions on a given day t Some of the social mediausers may also be customers of the e-commerce platform butwe do not attempt to align the user activities across the twoplatforms

We consider the predictive power of the social mediasignals regarding specific e-commerce platform activities iefv(bi t St Et) a discrete volume function of a brand bi

a timestamp t a social media stream St and a type of e-commerce activity stream Et We are in fact interested inaggregated volumes of specific EPA types and consider a timeframe T = [tp tp+h] where the date tp is the day of predictionand tp+h is the prediction horizon h = 1 3 7 (1-day 3-day7-day ahead) Therefore we aim to learn a proxy function fvthat at tp defines a cumulative function

FV (bi T SE) =

t=tp+h983131

t=tp

fv(bi t St Et)

integrating ie summing up fv over T

A Multi-class Predictions

Instead of predicting specific values of FV (bi S T Et) wepredict the distribution of FV in terms of quantile levels thatFV may attain on a given day or a given period of timeEvaluating the performance of quantile predictors enables usto assess whether different social media signals are predictiveof specific EPA levels We cast that as a multi-class categoriza-tion problem using supervised learning where the number ofquantiles q determines the number of classes Given a numberof quantiles q and a training set with values FV (bi S T Et)from the training time period we determine the data pointsFV (bi S T Et) within each quantile ie

Pr(FV (bi S T Et) le fk) = kq forallk isin [1 q minus 1]

and the corresponding ranges of FV values The mini-mummaximum values determine the thresholds for assigninga class label to an instance of FV

Training of the multi-class predictor is based on partitioningthe training set into distribution quantiles with correspondingclass labels and value ranges We use the quantile rangesto assign class labels to FV values of the test set If thetest distribution is broader than the one of the training setFV values below the minimum or above the maximum areassigned to the lowesthighest quantile respectively

Given a time of prediction tp we extract features fromthe social media stream S up to tp and predict FV for theprediction horizon tp+h We consider activities on a daily basisand use tp as 235959 on each day

B Feature Construction

Our feature set is based on the three types of SMA signalsPost Repost and Comment activities For each signal type wegenerate features by calculating statistics from SMA streamsfor the time spans of a pre-specified length In particular weconsider K-day periods with Kisin1 3 5 7 Table III showsthe statistics that we calculate for each signal type ie PostRepost Comment over the specific stream length K TheTheil-Sen estimator [28] is a non-parametric trend detectorthat corresponds to the median over all possible combinationsof slopes over K daily measures of the SMA activity In totalwe construct 22 features for each of the 3 SMA types thetotal of 66 features that characterize a vendorrsquos SMA

TABLE III Description of statistics used as features for eachSMA type over K days before tp For K = 1 we consider

only Featsum

Feature DescriptionFeatsum Sum of activity volume over K-day periodFeatmean Mean activity volume over K-day periodFeatmax Maximum activity volume over K-day periodFeatmin Maximum activity volume over K-day periodFeatvar Variance of activity volume over K-day periodFeatstdev Standard deviation of activity volume over K-day pe-

riodFeatprev Activity volume on previous dayFeattheil Theil-Sen estimator

V EXPERIMENTS

We use Logistic Regression and Random Forest [29]ndash[31] tolearn multi-class classification models for 2 3 and 5 quantiles(2-q 3-q 5-q) and vary the prediction horizon by modelingnext day (1-day) 3-day and 7-day cumulative volumes ofSearch Clickthrough and Orders We evaluate multi-class pre-diction results by calculating standard precision recall and F1statistics However we focus our discussion on the precisionsince the aim is to assess the effectiveness of SMA featuresin predicting volume levels of specific EPA types Thusidentifying true-positive instances within a specific quantileis given a priority over avoiding false-negative instances Infact correct predictions of high (top 20-30) and low(bottom 20-30) quantiles are of particular interest sincetheir detection can improve SMA campaigns and optimize e-commerce operations for high levels of activities that on somedays such as global shopping events can increase 100-fold

Fig 5 Sliding window of a 12-month training and 1-monthtest data with a shift of one calendar month to cover the test

period from Jan 2017 until Dec 2017

1) Temporal Cross-validation All our experiments areperformed using 12-fold time-series cross-validation as shownin Figure 5 with data sets specific to each vendor We trainand test our models using a two year JD data set (SectionIII-A) Our starting training set covers the 12 month periodfrom 1 January 2016 to 31 December 2016 The test set isthe following month In each fold we slide the 12-monthtraining and 1-month test period by one calendar month Thusfor each experiment we use one year of historical data andpredict EPA in every calendar month in 2017 For each vendorwe report the average precision statistics across 12-folds Wecollate and average precision statistics across 33 vendors andacross vendor categories

2) Activity Predictions For a given quantile scale (2-q 3-q5-q) we predict volumes of activities for the next day tp+1 (1-

(a) Precision statistics for 3-q next-day predictions for Orders (b) Precision statistics for 5-q next-day predictions for Orders

Fig 6 Precision of Random Forest predictors for 3-q and 5-q categories for the next-day Orders across 33 vendors The topquantile precisions are higher than random predictions (33 for 3-q and 20 for 5-q)

TABLE IV Precision statistics for Random Forest 5-q classifiers for the next-day volumes of each EPA type OrderClickthrough and Search Results are aggregated per five categories

Vendor 0-20 20-40 40-60 60-80 80-100AVG MAX MIN AVG MAX MIN AVG MAX MIN AVG MAX MIN AVG MAX MIN

Order

PampE 0159 0338 0013 0136 0218 0054 0198 0292 0070 0298 0570 0149 0341 0393 0271Sports 0129 0340 0000 0161 0370 0045 0154 0263 0021 0247 0425 0149 0456 0653 0154Food 0070 0231 0000 0090 0302 0000 0144 0271 0030 0313 0462 0181 0410 0708 0213Clothes 0199 0571 0000 0129 0217 0011 0186 0385 0065 0225 0464 0018 0308 0528 0076Home 0167 0596 0000 0157 0317 0000 0263 0517 0092 0222 0333 0091 0312 0481 0079

Clickthrough

PampE 0371 0548 0192 0200 0311 0099 0147 0241 0025 0159 0353 0000 0265 0565 0055Sports 0462 0631 0275 0264 0393 0094 0144 0328 0038 0135 0348 0024 0148 0286 0037Food 0351 0860 0083 0223 0488 0056 0177 0345 0030 0136 0290 0000 0162 0385 0036Clothes 0312 0571 0013 0165 0260 0030 0199 0426 0012 0145 0308 0000 0192 0500 0023Home 0324 0645 0100 0225 0347 0116 0188 0385 0025 0149 0259 0039 0174 0365 0000

Search

PampE 0521 0679 0336 0231 0353 0065 0109 0293 0029 0077 0176 0000 0255 0481 0070Sports 0483 0763 0276 0248 0350 0103 0130 0222 0000 0081 0173 0000 0126 0250 0066Food 0308 0784 0089 0193 0370 0027 0183 0324 0027 0175 0373 0052 0185 0317 0048Clothes 0329 0628 0194 0251 0353 0114 0232 0313 0131 0127 0238 0020 0119 0288 0043Home 0351 0573 0179 0232 0323 0041 0200 0322 0083 0200 0311 0098 0194 0316 0086

day) and the multi-day cumulative activities over 3-day and7-day periods For each prediction type and each individualvendor we use the corresponding quantile scale determinedover the training data Our experiments thus involve 891predictions for each of 33 vendors 3 EPA types (SearchClickthrough Orders) with three quantile scales (2-q 3-qand 5-q) and 3 time periods (1-day 3-day 7-day) For thisdiscussion we select experiments that shed light on

bull How successful the predictors are in identifying quantilesfor individual EPA types across the vendor sample

bull How well the predictors perform for the cumulative 3-dayand 7-day activities across the vendor sample

bull Which features significantly contribute to the perfor-mance of predictors for the individual EPA type

A Next-day EPA Predictions for 3 and 5 QuantilesExperiments with the next day predictions of EPA for 3-q

and 5-q show that the Random Forest (RF) predictors performhigher than random for Orders in the top quantiles (top 33and 20 respectively) Figure 6 presents RF precision forOrder quantiles for individual vendors Table IV summarizesRF precision statistics for all three EPA types and the fivevendor categories We highlight the RF results that are betterthan random predictions (gt 020 for 5-q) We see thatin addition to the top quantile predictions for Orders theprecision statistics are better than random for Search andClickthrough in the lowest quantile (bottom 33 for 3-q and

bottom 20 for 5-q) All other quantile predictions for OrderClickthrough and Search are on par or lower than random

We conducted the same experiments with Logistic Regres-sion (LR) and found that RF outperforms LR predictors Infact for our sample of product vendors the t-test performedfor LR and RF predictions yield p lt 005 for all the quantilelabels (not just the top quantiles)

B Predictions of Multi-day Cumulative EPA

Considering a selective success of SMA based predictorsfor quantiles of the next-day EPA we train predictors forcumulative EPA ie EPA volumes over 3 and 7 days Weexpect that multi-day cumulative statistics are less volatileand therefore the quantile levels may be more stable andpredictable over time

Our experiments show that RF predictors for 3-day and7-day cumulative EPA are consistently performing at thesimilar order of magnitude for a given quantile level That isillustrated in Table V showing the precision statistics of RFpredictors for 1-day 3-day and 7-day cumulative volumes ofOrder Clickthrough and Search for 3 quantiles (3-q) Similarobservations are made for 5 quantiles and for the LR classifier

We conclude that our high performing predictors for cu-mulative multi-day EPA volumes can be flexibly applied todifferent scenarios that benefit from observing cumulativeEPA across multiple days This includes the precision of topquantile volumes for Orders Figure 7 shows 3-q predictions

TABLE V Three quantile prediction results for five categories of products across 33 vendors based on Random Forestreporting average precision by categories of 1-day (1D) 3-day (3D) and 7-day (7D) cumulative Orders Clickthrough and

Search for each quantile

Vendor 0-33 33-66 66-1001D 3D 7D 1D 3D 7D 1D 3D 7D

Order

PE 0232 0225 0237 0327 0329 0303 0522 0536 0551Sports 0241 0212 0172 0237 0240 0298 0584 0598 0607Food 0141 0139 0133 0288 0230 0251 0663 0656 0675Clothes 0287 0286 0280 0297 0272 0310 0397 0431 0478Home 0261 0266 0244 0375 0383 0351 0453 0467 0464AVG 0226 0222 0209 0310 0293 0302 0528 0540 0556

Clickthrough

PE 0488 0480 0471 0269 0303 0232 0385 0362 0372Sports 0637 0620 0623 0226 0263 0214 0232 0223 0266Food 0464 0471 0476 0282 0285 0286 0263 0266 0285Clothes 0418 0399 0423 0301 0311 0293 0331 0328 0329Home 0475 0487 0484 0327 0307 0289 0285 0278 0256AVG 0488 0486 0490 0288 0295 0271 0291 0286 0293

Search

PE 0597 0621 0655 0182 0194 0150 0277 0259 0289Sports 0621 0613 0616 0266 0248 0263 0166 0157 0152Food 0442 0413 0440 0316 0319 0292 0299 0291 0276Clothes 0455 0479 0476 0375 0370 0355 0225 0205 0172Home 0453 0474 0524 0360 0313 0298 0330 0327 0326AVG 0493 0497 0522 0315 0301 0283 0271 0261 0254

(a) Precision of bottom quantile predictionsfor Orders

(b) Precision of middle quantile predictionsfor Orders

(c) Precision of top quantile prediction forOrders

Fig 7 Precision of Random Forest predictions into three quantiles for 1-day 3-day and 7-day cumulative Orders

of 1-day 3-day and 7-day cumulative Orders for individualvendors

C Feature Significance

Both Random Forest and Logistic Regression allow us toassess the significance of individual feature types in terms oftheir contributions to the prediction decision For illustrationwe present the analysis of features for the RF classifier trainedfor 3-q predictions of the next-day EPA volumes For eachclassifier we calculate the average relative rank of a featureTable VI shows the top 10 contributing features of 1-day EPApredictors for Orders Clickthrough and Search respectivelyacross all the vendors and 3 quantiles

We observe that the relative feature importance is to adegree in agreement with the observations from the corre-lation analysis in Section V-2 Search levels are predictedby Comment and Repost activities considered over differentlengths of time 1 3 5 and 7 days Clickthrough activitiesseem to be aligned with SMA over 7 and 3 day periodsprimarily with consumer comments Orders however areclearly related to Comment volumes over a longer periodsie 5 and 7 days Overall we recognize the importance of

features generated from Comment activities as they contributeto the predictors of all EPA types

TABLE VI Ten top-ranked features of the 3-q RandomForest predictors for the next-day EPA

Rank Order Clickthrough Search1 7DCommentmin 7DCommentmin 7DCommentmin

2 7DCommentsum 3DCommentmean 5DRepostmin

3 7DCommentmean 3DCommentsum PreviousDComment4 7DCommentmax PreviousDComment 7DCommentstd5 5DCommentmean 3DCommentmax 3DCommentmean

6 5DCommentmax 7DCommentsum 7DRepoststd7 5DCommentsum 7DCommentmean 5DRepostsum8 7DCommentstd 7DPostsum 3DCommentsum9 7DCommentvar 3DCommentmin 7DCommentvar10 3DCommentmean 5DCommentvar 5DCommentvar

VI CONCLUDING REMARKS

This paper presents a detailed empirical study of the re-lationship between SMA of product vendors and EPA ofconsumers interested in the vendorsrsquo products The studyis the first to characterize the correlation of specific SMAengagements ie posts reposts and comments and EPA typesthat correspond to specific stages in the consumersrsquo purchasedecisions ie search for brands and products clickthrough

product information and product orders Our analyses un-covered low-to-moderate correlations between the volumes ofSMA-EPA pairs suggesting that predicting daily volumes ofEPA based on SMA volumes alone is not well supported How-ever moderate correlations of Post-Search Post-Clickthroughand Comment-Order across vendors suggest that one may beable to train predictors of EPA distributions and their changesrather that the precise daily volumes We thus introduce a newapproach of characterizing SMA-EPA relationship in terms ofEPA quantiles We formulate EPA predictions as a multi-classcategorization problem into 2 3 and 5 quantiles Our RandomForest classifiers outperform both the random predictors andthe Logistic Regression classifiers for top quantiles of Orderand bottom quantiles of Search and Clickthrough Our studyprovides unique insights into the varied correlations of SMA-EPA pairs and a mixed success of SMA-based predictorsin determining EPA levels A general view that social me-dia engagements undoubtedly drive consumer traction on e-commerce platforms is not substantiated by the Sina Weiboand JD case study and requires further analysis In our futurework we will expand explorations of this issue with a broaderset of product categories and SMA analyses that considerproperties of the content exchanged through the SMA

VII ACKNOWLEDGEMENTS

We wish to express our gratitude to JDcom for providingaccess to invaluable data and Bin Xu and Hao Dong for theirassistance We also acknowledge the financial support from theInternational Doctoral Innovation Centre Ningbo EducationBureau Ningbo Science and Technology Bureau and theUniversity of Nottingham This work was also supported bythe UK Engineering and Physical Sciences Research Council[grant number EPL0154631]

REFERENCES

[1] J Yeo S Kim E Koh S-w Hwang and N Lipka ldquoPredicting onlinepurchase conversion for retargetingrdquo in Proceedings of the 10th ACMInternational Conference on Web Search and Data Mining pp 591ndash600ACM 2017

[2] C Lo D Frankowski and J Leskovec ldquoUnderstanding behaviors thatlead to purchasing A case study of pinterestrdquo in Proceedings of the22th ACM SIGKDD International Conference on Knowledge Discoveryand Data Mining pp 531ndash540 2016

[3] W W Moe ldquoBuying searching or browsing Differentiating betweenonline shoppers using in-store navigational clickstreamrdquo Journal ofConsumer Psychology vol 13 no 1-2 pp 29ndash39 2003

[4] C Sismeiro and R E Bucklin ldquoModeling purchase behavior at an e-commerce web site A task-completion approachrdquo Journal of MarketingResearch vol 41 no 3 pp 306ndash323 2004

[5] C Park D Kim J Oh and H Yu ldquoPredicting user purchase in e-commerce by comprehensive feature engineering and decision boundaryfocused under-samplingrdquo in Proceedings of the 2015 International ACMRecommender Systems Challenge p 8 ACM 2015

[6] D Van den Poel and W Buckinx ldquoPredicting online-purchasing be-haviourrdquo European Journal of Operational Research vol 166 no 2pp 557ndash575 2005

[7] D J Bertsimas A J Mersereau and N R Patel ldquoDynamic classifica-tion of online customersrdquo in Proceedings of the 2003 SIAM InternationalConference on Data Mining pp 107ndash118 SIAM 2003

[8] K L Kotler Philip Keller Marketing Management volume 14 PrenticeHall Englewood Cliffs 2015

[9] L Xu J A Duan and A Whinston ldquoPath to purchase A mutuallyexciting point process model for online advertising and conversionrdquoManagement Science vol 60 no 6 pp 1392ndash1412 2014

[10] X Yu Y Liu X Huang and A An ldquoMining online reviews forpredicting sales performance A case study in the movie domainrdquoIEEE Transactions on Knowledge and Data Engineering vol 24 no 4pp 720ndash734 2012

[11] E Young Kim and Y-K Kim ldquoPredicting online purchase intentionsfor clothing productsrdquo European Journal of Marketing vol 38 no 7pp 883ndash897 2004

[12] D Hawkins D Mothersbaugh and R Best ldquoConsumer behaviourBuilding marketing strategy mcgraw hillrdquo 2012

[13] G Kulkarni P Kannan and W Moe ldquoUsing online search data toforecast new product salesrdquo Decision Support Systems vol 52 no 3pp 604ndash611 2012

[14] J Wang W X Zhao H Wei H Yan and X Li ldquoMining new businessopportunities Identifying trend related products by leveraging commer-cial intents from microblogsrdquo in Proceedings of the 2013 Conferenceon Empirical Methods in Natural Language Processing pp 1337ndash13472013

[15] B Hollerit M Kroll and M Strohmaier ldquoTowards linking buyers andsellers detecting commercial intent on twitterrdquo in Proceedings of the22nd International Conference on World Wide Web pp 629ndash632 ACM2013

[16] F Kooti K Lerman L M Aiello M Grbovic N Djuric and V Ra-dosavljevic ldquoPortrait of an online shopper Understanding and predictingconsumer behaviorrdquo in Proceedings of the 9th ACM InternationalConference on Web Search and Data Mining pp 205ndash214 ACM 2016

[17] J Leskovec L A Adamic and B A Huberman ldquoThe dynamics ofviral marketingrdquo ACM Transactions on The Web (TWEB) vol 1 no 1p 5 2007

[18] A Anagnostopoulos R Kumar and M Mahdian ldquoInfluence and cor-relation in social networksrdquo in Proceedings of the 14th ACM SIGKDDInternational Conference on Knowledge Discovery and Data Miningpp 7ndash15 ACM 2008

[19] D Crandall D Cosley D Huttenlocher J Kleinberg and S SurildquoFeedback effects between similarity and social influence in onlinecommunitiesrdquo in Proceedings of the 14th ACM SIGKDD InternationalConference on Knowledge Discovery and Data Mining pp 160ndash168ACM 2008

[20] C R Shalizi and A C Thomas ldquoHomophily and contagion are generi-cally confounded in observational social network studiesrdquo SociologicalMethods amp Research vol 40 no 2 pp 211ndash239 2011

[21] V Gupta D Varshney H Jhamtani D Kedia and S Karwa ldquoIdenti-fying purchase intent from social postsrdquo in ICWSM 2014

[22] Y Zhang and M Pennacchiotti ldquoPredicting purchase behaviors fromsocial mediardquo in Proceedings of the 22nd International Conference onWorld Wide Web pp 1521ndash1532 ACM 2013

[23] X W Zhao Y Guo Y He H Jiang Y Wu and X Li ldquoWe knowwhat you want to buy a demographic-based system for product recom-mendation on microblogsrdquo in Proceedings of the 20th ACM SIGKDDInternational Conference on Knowledge Discovery and Data Miningpp 1935ndash1944 ACM 2014

[24] R Crane and D Sornette ldquoRobust dynamic classes revealed by mea-suring the response function of a social systemrdquo Proceedings of theNational Academy of Sciences vol 105 no 41 pp 15649ndash15653 2008

[25] J Lehmann B Goncalves J J Ramasco and C Cattuto ldquoDynamicalclasses of collective attention in twitterrdquo in Proceedings of the 21stInternational Conference on World Wide Web pp 251ndash260 ACM 2012

[26] D M Romero B Meeder and J Kleinberg ldquoDifferences in the mechan-ics of information diffusion across topics idioms political hashtags andcomplex contagion on twitterrdquo in Proceedings of the 20th InternationalConference on World Wide Web pp 695ndash704 ACM 2011

[27] J Yang and J Leskovec ldquoPatterns of temporal variation in onlinemediardquo in Proceedings of the 4th ACM International Conference onWeb Search and Data Mining pp 177ndash186 ACM 2011

[28] M G Akritas S A Murphy and M P Lavalley ldquoThe theil-senestimator with doubly censored data and applications to astronomyrdquoJournal of the American Statistical Association vol 90 no 429pp 170ndash177 1995

[29] D W Hosmer Jr S Lemeshow and R X Sturdivant Applied logisticregression vol 398 John Wiley amp Sons 2013

[30] H-F Yu F-L Huang and C-J Lin ldquoDual coordinate descent methodsfor logistic regression and maximum entropy modelsrdquo Machine Learn-ing vol 85 no 1 pp 41ndash75 2011

[31] A Liaw M Wiener et al ldquoClassification and regression by random-forestrdquo R News vol 2 no 3 pp 18ndash22 2002

Page 5: Social Media Brand Engagement as a Proxy for E-commerce ...eprints.nottingham.ac.uk/55994/1/Social Media Brand... · of tools and services for marketeers to manage their Social Media

a timestamp t a social media stream St and a type of e-commerce activity stream Et We are in fact interested inaggregated volumes of specific EPA types and consider a timeframe T = [tp tp+h] where the date tp is the day of predictionand tp+h is the prediction horizon h = 1 3 7 (1-day 3-day7-day ahead) Therefore we aim to learn a proxy function fvthat at tp defines a cumulative function

FV (bi T SE) =

t=tp+h983131

t=tp

fv(bi t St Et)

integrating ie summing up fv over T

A Multi-class Predictions

Instead of predicting specific values of FV (bi S T Et) wepredict the distribution of FV in terms of quantile levels thatFV may attain on a given day or a given period of timeEvaluating the performance of quantile predictors enables usto assess whether different social media signals are predictiveof specific EPA levels We cast that as a multi-class categoriza-tion problem using supervised learning where the number ofquantiles q determines the number of classes Given a numberof quantiles q and a training set with values FV (bi S T Et)from the training time period we determine the data pointsFV (bi S T Et) within each quantile ie

Pr(FV (bi S T Et) le fk) = kq forallk isin [1 q minus 1]

and the corresponding ranges of FV values The mini-mummaximum values determine the thresholds for assigninga class label to an instance of FV

Training of the multi-class predictor is based on partitioningthe training set into distribution quantiles with correspondingclass labels and value ranges We use the quantile rangesto assign class labels to FV values of the test set If thetest distribution is broader than the one of the training setFV values below the minimum or above the maximum areassigned to the lowesthighest quantile respectively

Given a time of prediction tp we extract features fromthe social media stream S up to tp and predict FV for theprediction horizon tp+h We consider activities on a daily basisand use tp as 235959 on each day

B Feature Construction

Our feature set is based on the three types of SMA signalsPost Repost and Comment activities For each signal type wegenerate features by calculating statistics from SMA streamsfor the time spans of a pre-specified length In particular weconsider K-day periods with Kisin1 3 5 7 Table III showsthe statistics that we calculate for each signal type ie PostRepost Comment over the specific stream length K TheTheil-Sen estimator [28] is a non-parametric trend detectorthat corresponds to the median over all possible combinationsof slopes over K daily measures of the SMA activity In totalwe construct 22 features for each of the 3 SMA types thetotal of 66 features that characterize a vendorrsquos SMA

TABLE III Description of statistics used as features for eachSMA type over K days before tp For K = 1 we consider

only Featsum

Feature DescriptionFeatsum Sum of activity volume over K-day periodFeatmean Mean activity volume over K-day periodFeatmax Maximum activity volume over K-day periodFeatmin Maximum activity volume over K-day periodFeatvar Variance of activity volume over K-day periodFeatstdev Standard deviation of activity volume over K-day pe-

riodFeatprev Activity volume on previous dayFeattheil Theil-Sen estimator

V EXPERIMENTS

We use Logistic Regression and Random Forest [29]ndash[31] tolearn multi-class classification models for 2 3 and 5 quantiles(2-q 3-q 5-q) and vary the prediction horizon by modelingnext day (1-day) 3-day and 7-day cumulative volumes ofSearch Clickthrough and Orders We evaluate multi-class pre-diction results by calculating standard precision recall and F1statistics However we focus our discussion on the precisionsince the aim is to assess the effectiveness of SMA featuresin predicting volume levels of specific EPA types Thusidentifying true-positive instances within a specific quantileis given a priority over avoiding false-negative instances Infact correct predictions of high (top 20-30) and low(bottom 20-30) quantiles are of particular interest sincetheir detection can improve SMA campaigns and optimize e-commerce operations for high levels of activities that on somedays such as global shopping events can increase 100-fold

Fig 5 Sliding window of a 12-month training and 1-monthtest data with a shift of one calendar month to cover the test

period from Jan 2017 until Dec 2017

1) Temporal Cross-validation All our experiments areperformed using 12-fold time-series cross-validation as shownin Figure 5 with data sets specific to each vendor We trainand test our models using a two year JD data set (SectionIII-A) Our starting training set covers the 12 month periodfrom 1 January 2016 to 31 December 2016 The test set isthe following month In each fold we slide the 12-monthtraining and 1-month test period by one calendar month Thusfor each experiment we use one year of historical data andpredict EPA in every calendar month in 2017 For each vendorwe report the average precision statistics across 12-folds Wecollate and average precision statistics across 33 vendors andacross vendor categories

2) Activity Predictions For a given quantile scale (2-q 3-q5-q) we predict volumes of activities for the next day tp+1 (1-

(a) Precision statistics for 3-q next-day predictions for Orders (b) Precision statistics for 5-q next-day predictions for Orders

Fig 6 Precision of Random Forest predictors for 3-q and 5-q categories for the next-day Orders across 33 vendors The topquantile precisions are higher than random predictions (33 for 3-q and 20 for 5-q)

TABLE IV Precision statistics for Random Forest 5-q classifiers for the next-day volumes of each EPA type OrderClickthrough and Search Results are aggregated per five categories

Vendor 0-20 20-40 40-60 60-80 80-100AVG MAX MIN AVG MAX MIN AVG MAX MIN AVG MAX MIN AVG MAX MIN

Order

PampE 0159 0338 0013 0136 0218 0054 0198 0292 0070 0298 0570 0149 0341 0393 0271Sports 0129 0340 0000 0161 0370 0045 0154 0263 0021 0247 0425 0149 0456 0653 0154Food 0070 0231 0000 0090 0302 0000 0144 0271 0030 0313 0462 0181 0410 0708 0213Clothes 0199 0571 0000 0129 0217 0011 0186 0385 0065 0225 0464 0018 0308 0528 0076Home 0167 0596 0000 0157 0317 0000 0263 0517 0092 0222 0333 0091 0312 0481 0079

Clickthrough

PampE 0371 0548 0192 0200 0311 0099 0147 0241 0025 0159 0353 0000 0265 0565 0055Sports 0462 0631 0275 0264 0393 0094 0144 0328 0038 0135 0348 0024 0148 0286 0037Food 0351 0860 0083 0223 0488 0056 0177 0345 0030 0136 0290 0000 0162 0385 0036Clothes 0312 0571 0013 0165 0260 0030 0199 0426 0012 0145 0308 0000 0192 0500 0023Home 0324 0645 0100 0225 0347 0116 0188 0385 0025 0149 0259 0039 0174 0365 0000

Search

PampE 0521 0679 0336 0231 0353 0065 0109 0293 0029 0077 0176 0000 0255 0481 0070Sports 0483 0763 0276 0248 0350 0103 0130 0222 0000 0081 0173 0000 0126 0250 0066Food 0308 0784 0089 0193 0370 0027 0183 0324 0027 0175 0373 0052 0185 0317 0048Clothes 0329 0628 0194 0251 0353 0114 0232 0313 0131 0127 0238 0020 0119 0288 0043Home 0351 0573 0179 0232 0323 0041 0200 0322 0083 0200 0311 0098 0194 0316 0086

day) and the multi-day cumulative activities over 3-day and7-day periods For each prediction type and each individualvendor we use the corresponding quantile scale determinedover the training data Our experiments thus involve 891predictions for each of 33 vendors 3 EPA types (SearchClickthrough Orders) with three quantile scales (2-q 3-qand 5-q) and 3 time periods (1-day 3-day 7-day) For thisdiscussion we select experiments that shed light on

bull How successful the predictors are in identifying quantilesfor individual EPA types across the vendor sample

bull How well the predictors perform for the cumulative 3-dayand 7-day activities across the vendor sample

bull Which features significantly contribute to the perfor-mance of predictors for the individual EPA type

A Next-day EPA Predictions for 3 and 5 QuantilesExperiments with the next day predictions of EPA for 3-q

and 5-q show that the Random Forest (RF) predictors performhigher than random for Orders in the top quantiles (top 33and 20 respectively) Figure 6 presents RF precision forOrder quantiles for individual vendors Table IV summarizesRF precision statistics for all three EPA types and the fivevendor categories We highlight the RF results that are betterthan random predictions (gt 020 for 5-q) We see thatin addition to the top quantile predictions for Orders theprecision statistics are better than random for Search andClickthrough in the lowest quantile (bottom 33 for 3-q and

bottom 20 for 5-q) All other quantile predictions for OrderClickthrough and Search are on par or lower than random

We conducted the same experiments with Logistic Regres-sion (LR) and found that RF outperforms LR predictors Infact for our sample of product vendors the t-test performedfor LR and RF predictions yield p lt 005 for all the quantilelabels (not just the top quantiles)

B Predictions of Multi-day Cumulative EPA

Considering a selective success of SMA based predictorsfor quantiles of the next-day EPA we train predictors forcumulative EPA ie EPA volumes over 3 and 7 days Weexpect that multi-day cumulative statistics are less volatileand therefore the quantile levels may be more stable andpredictable over time

Our experiments show that RF predictors for 3-day and7-day cumulative EPA are consistently performing at thesimilar order of magnitude for a given quantile level That isillustrated in Table V showing the precision statistics of RFpredictors for 1-day 3-day and 7-day cumulative volumes ofOrder Clickthrough and Search for 3 quantiles (3-q) Similarobservations are made for 5 quantiles and for the LR classifier

We conclude that our high performing predictors for cu-mulative multi-day EPA volumes can be flexibly applied todifferent scenarios that benefit from observing cumulativeEPA across multiple days This includes the precision of topquantile volumes for Orders Figure 7 shows 3-q predictions

TABLE V Three quantile prediction results for five categories of products across 33 vendors based on Random Forestreporting average precision by categories of 1-day (1D) 3-day (3D) and 7-day (7D) cumulative Orders Clickthrough and

Search for each quantile

Vendor 0-33 33-66 66-1001D 3D 7D 1D 3D 7D 1D 3D 7D

Order

PE 0232 0225 0237 0327 0329 0303 0522 0536 0551Sports 0241 0212 0172 0237 0240 0298 0584 0598 0607Food 0141 0139 0133 0288 0230 0251 0663 0656 0675Clothes 0287 0286 0280 0297 0272 0310 0397 0431 0478Home 0261 0266 0244 0375 0383 0351 0453 0467 0464AVG 0226 0222 0209 0310 0293 0302 0528 0540 0556

Clickthrough

PE 0488 0480 0471 0269 0303 0232 0385 0362 0372Sports 0637 0620 0623 0226 0263 0214 0232 0223 0266Food 0464 0471 0476 0282 0285 0286 0263 0266 0285Clothes 0418 0399 0423 0301 0311 0293 0331 0328 0329Home 0475 0487 0484 0327 0307 0289 0285 0278 0256AVG 0488 0486 0490 0288 0295 0271 0291 0286 0293

Search

PE 0597 0621 0655 0182 0194 0150 0277 0259 0289Sports 0621 0613 0616 0266 0248 0263 0166 0157 0152Food 0442 0413 0440 0316 0319 0292 0299 0291 0276Clothes 0455 0479 0476 0375 0370 0355 0225 0205 0172Home 0453 0474 0524 0360 0313 0298 0330 0327 0326AVG 0493 0497 0522 0315 0301 0283 0271 0261 0254

(a) Precision of bottom quantile predictionsfor Orders

(b) Precision of middle quantile predictionsfor Orders

(c) Precision of top quantile prediction forOrders

Fig 7 Precision of Random Forest predictions into three quantiles for 1-day 3-day and 7-day cumulative Orders

of 1-day 3-day and 7-day cumulative Orders for individualvendors

C Feature Significance

Both Random Forest and Logistic Regression allow us toassess the significance of individual feature types in terms oftheir contributions to the prediction decision For illustrationwe present the analysis of features for the RF classifier trainedfor 3-q predictions of the next-day EPA volumes For eachclassifier we calculate the average relative rank of a featureTable VI shows the top 10 contributing features of 1-day EPApredictors for Orders Clickthrough and Search respectivelyacross all the vendors and 3 quantiles

We observe that the relative feature importance is to adegree in agreement with the observations from the corre-lation analysis in Section V-2 Search levels are predictedby Comment and Repost activities considered over differentlengths of time 1 3 5 and 7 days Clickthrough activitiesseem to be aligned with SMA over 7 and 3 day periodsprimarily with consumer comments Orders however areclearly related to Comment volumes over a longer periodsie 5 and 7 days Overall we recognize the importance of

features generated from Comment activities as they contributeto the predictors of all EPA types

TABLE VI Ten top-ranked features of the 3-q RandomForest predictors for the next-day EPA

Rank Order Clickthrough Search1 7DCommentmin 7DCommentmin 7DCommentmin

2 7DCommentsum 3DCommentmean 5DRepostmin

3 7DCommentmean 3DCommentsum PreviousDComment4 7DCommentmax PreviousDComment 7DCommentstd5 5DCommentmean 3DCommentmax 3DCommentmean

6 5DCommentmax 7DCommentsum 7DRepoststd7 5DCommentsum 7DCommentmean 5DRepostsum8 7DCommentstd 7DPostsum 3DCommentsum9 7DCommentvar 3DCommentmin 7DCommentvar10 3DCommentmean 5DCommentvar 5DCommentvar

VI CONCLUDING REMARKS

This paper presents a detailed empirical study of the re-lationship between SMA of product vendors and EPA ofconsumers interested in the vendorsrsquo products The studyis the first to characterize the correlation of specific SMAengagements ie posts reposts and comments and EPA typesthat correspond to specific stages in the consumersrsquo purchasedecisions ie search for brands and products clickthrough

product information and product orders Our analyses un-covered low-to-moderate correlations between the volumes ofSMA-EPA pairs suggesting that predicting daily volumes ofEPA based on SMA volumes alone is not well supported How-ever moderate correlations of Post-Search Post-Clickthroughand Comment-Order across vendors suggest that one may beable to train predictors of EPA distributions and their changesrather that the precise daily volumes We thus introduce a newapproach of characterizing SMA-EPA relationship in terms ofEPA quantiles We formulate EPA predictions as a multi-classcategorization problem into 2 3 and 5 quantiles Our RandomForest classifiers outperform both the random predictors andthe Logistic Regression classifiers for top quantiles of Orderand bottom quantiles of Search and Clickthrough Our studyprovides unique insights into the varied correlations of SMA-EPA pairs and a mixed success of SMA-based predictorsin determining EPA levels A general view that social me-dia engagements undoubtedly drive consumer traction on e-commerce platforms is not substantiated by the Sina Weiboand JD case study and requires further analysis In our futurework we will expand explorations of this issue with a broaderset of product categories and SMA analyses that considerproperties of the content exchanged through the SMA

VII ACKNOWLEDGEMENTS

We wish to express our gratitude to JDcom for providingaccess to invaluable data and Bin Xu and Hao Dong for theirassistance We also acknowledge the financial support from theInternational Doctoral Innovation Centre Ningbo EducationBureau Ningbo Science and Technology Bureau and theUniversity of Nottingham This work was also supported bythe UK Engineering and Physical Sciences Research Council[grant number EPL0154631]

REFERENCES

[1] J Yeo S Kim E Koh S-w Hwang and N Lipka ldquoPredicting onlinepurchase conversion for retargetingrdquo in Proceedings of the 10th ACMInternational Conference on Web Search and Data Mining pp 591ndash600ACM 2017

[2] C Lo D Frankowski and J Leskovec ldquoUnderstanding behaviors thatlead to purchasing A case study of pinterestrdquo in Proceedings of the22th ACM SIGKDD International Conference on Knowledge Discoveryand Data Mining pp 531ndash540 2016

[3] W W Moe ldquoBuying searching or browsing Differentiating betweenonline shoppers using in-store navigational clickstreamrdquo Journal ofConsumer Psychology vol 13 no 1-2 pp 29ndash39 2003

[4] C Sismeiro and R E Bucklin ldquoModeling purchase behavior at an e-commerce web site A task-completion approachrdquo Journal of MarketingResearch vol 41 no 3 pp 306ndash323 2004

[5] C Park D Kim J Oh and H Yu ldquoPredicting user purchase in e-commerce by comprehensive feature engineering and decision boundaryfocused under-samplingrdquo in Proceedings of the 2015 International ACMRecommender Systems Challenge p 8 ACM 2015

[6] D Van den Poel and W Buckinx ldquoPredicting online-purchasing be-haviourrdquo European Journal of Operational Research vol 166 no 2pp 557ndash575 2005

[7] D J Bertsimas A J Mersereau and N R Patel ldquoDynamic classifica-tion of online customersrdquo in Proceedings of the 2003 SIAM InternationalConference on Data Mining pp 107ndash118 SIAM 2003

[8] K L Kotler Philip Keller Marketing Management volume 14 PrenticeHall Englewood Cliffs 2015

[9] L Xu J A Duan and A Whinston ldquoPath to purchase A mutuallyexciting point process model for online advertising and conversionrdquoManagement Science vol 60 no 6 pp 1392ndash1412 2014

[10] X Yu Y Liu X Huang and A An ldquoMining online reviews forpredicting sales performance A case study in the movie domainrdquoIEEE Transactions on Knowledge and Data Engineering vol 24 no 4pp 720ndash734 2012

[11] E Young Kim and Y-K Kim ldquoPredicting online purchase intentionsfor clothing productsrdquo European Journal of Marketing vol 38 no 7pp 883ndash897 2004

[12] D Hawkins D Mothersbaugh and R Best ldquoConsumer behaviourBuilding marketing strategy mcgraw hillrdquo 2012

[13] G Kulkarni P Kannan and W Moe ldquoUsing online search data toforecast new product salesrdquo Decision Support Systems vol 52 no 3pp 604ndash611 2012

[14] J Wang W X Zhao H Wei H Yan and X Li ldquoMining new businessopportunities Identifying trend related products by leveraging commer-cial intents from microblogsrdquo in Proceedings of the 2013 Conferenceon Empirical Methods in Natural Language Processing pp 1337ndash13472013

[15] B Hollerit M Kroll and M Strohmaier ldquoTowards linking buyers andsellers detecting commercial intent on twitterrdquo in Proceedings of the22nd International Conference on World Wide Web pp 629ndash632 ACM2013

[16] F Kooti K Lerman L M Aiello M Grbovic N Djuric and V Ra-dosavljevic ldquoPortrait of an online shopper Understanding and predictingconsumer behaviorrdquo in Proceedings of the 9th ACM InternationalConference on Web Search and Data Mining pp 205ndash214 ACM 2016

[17] J Leskovec L A Adamic and B A Huberman ldquoThe dynamics ofviral marketingrdquo ACM Transactions on The Web (TWEB) vol 1 no 1p 5 2007

[18] A Anagnostopoulos R Kumar and M Mahdian ldquoInfluence and cor-relation in social networksrdquo in Proceedings of the 14th ACM SIGKDDInternational Conference on Knowledge Discovery and Data Miningpp 7ndash15 ACM 2008

[19] D Crandall D Cosley D Huttenlocher J Kleinberg and S SurildquoFeedback effects between similarity and social influence in onlinecommunitiesrdquo in Proceedings of the 14th ACM SIGKDD InternationalConference on Knowledge Discovery and Data Mining pp 160ndash168ACM 2008

[20] C R Shalizi and A C Thomas ldquoHomophily and contagion are generi-cally confounded in observational social network studiesrdquo SociologicalMethods amp Research vol 40 no 2 pp 211ndash239 2011

[21] V Gupta D Varshney H Jhamtani D Kedia and S Karwa ldquoIdenti-fying purchase intent from social postsrdquo in ICWSM 2014

[22] Y Zhang and M Pennacchiotti ldquoPredicting purchase behaviors fromsocial mediardquo in Proceedings of the 22nd International Conference onWorld Wide Web pp 1521ndash1532 ACM 2013

[23] X W Zhao Y Guo Y He H Jiang Y Wu and X Li ldquoWe knowwhat you want to buy a demographic-based system for product recom-mendation on microblogsrdquo in Proceedings of the 20th ACM SIGKDDInternational Conference on Knowledge Discovery and Data Miningpp 1935ndash1944 ACM 2014

[24] R Crane and D Sornette ldquoRobust dynamic classes revealed by mea-suring the response function of a social systemrdquo Proceedings of theNational Academy of Sciences vol 105 no 41 pp 15649ndash15653 2008

[25] J Lehmann B Goncalves J J Ramasco and C Cattuto ldquoDynamicalclasses of collective attention in twitterrdquo in Proceedings of the 21stInternational Conference on World Wide Web pp 251ndash260 ACM 2012

[26] D M Romero B Meeder and J Kleinberg ldquoDifferences in the mechan-ics of information diffusion across topics idioms political hashtags andcomplex contagion on twitterrdquo in Proceedings of the 20th InternationalConference on World Wide Web pp 695ndash704 ACM 2011

[27] J Yang and J Leskovec ldquoPatterns of temporal variation in onlinemediardquo in Proceedings of the 4th ACM International Conference onWeb Search and Data Mining pp 177ndash186 ACM 2011

[28] M G Akritas S A Murphy and M P Lavalley ldquoThe theil-senestimator with doubly censored data and applications to astronomyrdquoJournal of the American Statistical Association vol 90 no 429pp 170ndash177 1995

[29] D W Hosmer Jr S Lemeshow and R X Sturdivant Applied logisticregression vol 398 John Wiley amp Sons 2013

[30] H-F Yu F-L Huang and C-J Lin ldquoDual coordinate descent methodsfor logistic regression and maximum entropy modelsrdquo Machine Learn-ing vol 85 no 1 pp 41ndash75 2011

[31] A Liaw M Wiener et al ldquoClassification and regression by random-forestrdquo R News vol 2 no 3 pp 18ndash22 2002

Page 6: Social Media Brand Engagement as a Proxy for E-commerce ...eprints.nottingham.ac.uk/55994/1/Social Media Brand... · of tools and services for marketeers to manage their Social Media

(a) Precision statistics for 3-q next-day predictions for Orders (b) Precision statistics for 5-q next-day predictions for Orders

Fig 6 Precision of Random Forest predictors for 3-q and 5-q categories for the next-day Orders across 33 vendors The topquantile precisions are higher than random predictions (33 for 3-q and 20 for 5-q)

TABLE IV Precision statistics for Random Forest 5-q classifiers for the next-day volumes of each EPA type OrderClickthrough and Search Results are aggregated per five categories

Vendor 0-20 20-40 40-60 60-80 80-100AVG MAX MIN AVG MAX MIN AVG MAX MIN AVG MAX MIN AVG MAX MIN

Order

PampE 0159 0338 0013 0136 0218 0054 0198 0292 0070 0298 0570 0149 0341 0393 0271Sports 0129 0340 0000 0161 0370 0045 0154 0263 0021 0247 0425 0149 0456 0653 0154Food 0070 0231 0000 0090 0302 0000 0144 0271 0030 0313 0462 0181 0410 0708 0213Clothes 0199 0571 0000 0129 0217 0011 0186 0385 0065 0225 0464 0018 0308 0528 0076Home 0167 0596 0000 0157 0317 0000 0263 0517 0092 0222 0333 0091 0312 0481 0079

Clickthrough

PampE 0371 0548 0192 0200 0311 0099 0147 0241 0025 0159 0353 0000 0265 0565 0055Sports 0462 0631 0275 0264 0393 0094 0144 0328 0038 0135 0348 0024 0148 0286 0037Food 0351 0860 0083 0223 0488 0056 0177 0345 0030 0136 0290 0000 0162 0385 0036Clothes 0312 0571 0013 0165 0260 0030 0199 0426 0012 0145 0308 0000 0192 0500 0023Home 0324 0645 0100 0225 0347 0116 0188 0385 0025 0149 0259 0039 0174 0365 0000

Search

PampE 0521 0679 0336 0231 0353 0065 0109 0293 0029 0077 0176 0000 0255 0481 0070Sports 0483 0763 0276 0248 0350 0103 0130 0222 0000 0081 0173 0000 0126 0250 0066Food 0308 0784 0089 0193 0370 0027 0183 0324 0027 0175 0373 0052 0185 0317 0048Clothes 0329 0628 0194 0251 0353 0114 0232 0313 0131 0127 0238 0020 0119 0288 0043Home 0351 0573 0179 0232 0323 0041 0200 0322 0083 0200 0311 0098 0194 0316 0086

day) and the multi-day cumulative activities over 3-day and7-day periods For each prediction type and each individualvendor we use the corresponding quantile scale determinedover the training data Our experiments thus involve 891predictions for each of 33 vendors 3 EPA types (SearchClickthrough Orders) with three quantile scales (2-q 3-qand 5-q) and 3 time periods (1-day 3-day 7-day) For thisdiscussion we select experiments that shed light on

bull How successful the predictors are in identifying quantilesfor individual EPA types across the vendor sample

bull How well the predictors perform for the cumulative 3-dayand 7-day activities across the vendor sample

bull Which features significantly contribute to the perfor-mance of predictors for the individual EPA type

A Next-day EPA Predictions for 3 and 5 QuantilesExperiments with the next day predictions of EPA for 3-q

and 5-q show that the Random Forest (RF) predictors performhigher than random for Orders in the top quantiles (top 33and 20 respectively) Figure 6 presents RF precision forOrder quantiles for individual vendors Table IV summarizesRF precision statistics for all three EPA types and the fivevendor categories We highlight the RF results that are betterthan random predictions (gt 020 for 5-q) We see thatin addition to the top quantile predictions for Orders theprecision statistics are better than random for Search andClickthrough in the lowest quantile (bottom 33 for 3-q and

bottom 20 for 5-q) All other quantile predictions for OrderClickthrough and Search are on par or lower than random

We conducted the same experiments with Logistic Regres-sion (LR) and found that RF outperforms LR predictors Infact for our sample of product vendors the t-test performedfor LR and RF predictions yield p lt 005 for all the quantilelabels (not just the top quantiles)

B Predictions of Multi-day Cumulative EPA

Considering a selective success of SMA based predictorsfor quantiles of the next-day EPA we train predictors forcumulative EPA ie EPA volumes over 3 and 7 days Weexpect that multi-day cumulative statistics are less volatileand therefore the quantile levels may be more stable andpredictable over time

Our experiments show that RF predictors for 3-day and7-day cumulative EPA are consistently performing at thesimilar order of magnitude for a given quantile level That isillustrated in Table V showing the precision statistics of RFpredictors for 1-day 3-day and 7-day cumulative volumes ofOrder Clickthrough and Search for 3 quantiles (3-q) Similarobservations are made for 5 quantiles and for the LR classifier

We conclude that our high performing predictors for cu-mulative multi-day EPA volumes can be flexibly applied todifferent scenarios that benefit from observing cumulativeEPA across multiple days This includes the precision of topquantile volumes for Orders Figure 7 shows 3-q predictions

TABLE V Three quantile prediction results for five categories of products across 33 vendors based on Random Forestreporting average precision by categories of 1-day (1D) 3-day (3D) and 7-day (7D) cumulative Orders Clickthrough and

Search for each quantile

Vendor 0-33 33-66 66-1001D 3D 7D 1D 3D 7D 1D 3D 7D

Order

PE 0232 0225 0237 0327 0329 0303 0522 0536 0551Sports 0241 0212 0172 0237 0240 0298 0584 0598 0607Food 0141 0139 0133 0288 0230 0251 0663 0656 0675Clothes 0287 0286 0280 0297 0272 0310 0397 0431 0478Home 0261 0266 0244 0375 0383 0351 0453 0467 0464AVG 0226 0222 0209 0310 0293 0302 0528 0540 0556

Clickthrough

PE 0488 0480 0471 0269 0303 0232 0385 0362 0372Sports 0637 0620 0623 0226 0263 0214 0232 0223 0266Food 0464 0471 0476 0282 0285 0286 0263 0266 0285Clothes 0418 0399 0423 0301 0311 0293 0331 0328 0329Home 0475 0487 0484 0327 0307 0289 0285 0278 0256AVG 0488 0486 0490 0288 0295 0271 0291 0286 0293

Search

PE 0597 0621 0655 0182 0194 0150 0277 0259 0289Sports 0621 0613 0616 0266 0248 0263 0166 0157 0152Food 0442 0413 0440 0316 0319 0292 0299 0291 0276Clothes 0455 0479 0476 0375 0370 0355 0225 0205 0172Home 0453 0474 0524 0360 0313 0298 0330 0327 0326AVG 0493 0497 0522 0315 0301 0283 0271 0261 0254

(a) Precision of bottom quantile predictionsfor Orders

(b) Precision of middle quantile predictionsfor Orders

(c) Precision of top quantile prediction forOrders

Fig 7 Precision of Random Forest predictions into three quantiles for 1-day 3-day and 7-day cumulative Orders

of 1-day 3-day and 7-day cumulative Orders for individualvendors

C Feature Significance

Both Random Forest and Logistic Regression allow us toassess the significance of individual feature types in terms oftheir contributions to the prediction decision For illustrationwe present the analysis of features for the RF classifier trainedfor 3-q predictions of the next-day EPA volumes For eachclassifier we calculate the average relative rank of a featureTable VI shows the top 10 contributing features of 1-day EPApredictors for Orders Clickthrough and Search respectivelyacross all the vendors and 3 quantiles

We observe that the relative feature importance is to adegree in agreement with the observations from the corre-lation analysis in Section V-2 Search levels are predictedby Comment and Repost activities considered over differentlengths of time 1 3 5 and 7 days Clickthrough activitiesseem to be aligned with SMA over 7 and 3 day periodsprimarily with consumer comments Orders however areclearly related to Comment volumes over a longer periodsie 5 and 7 days Overall we recognize the importance of

features generated from Comment activities as they contributeto the predictors of all EPA types

TABLE VI Ten top-ranked features of the 3-q RandomForest predictors for the next-day EPA

Rank Order Clickthrough Search1 7DCommentmin 7DCommentmin 7DCommentmin

2 7DCommentsum 3DCommentmean 5DRepostmin

3 7DCommentmean 3DCommentsum PreviousDComment4 7DCommentmax PreviousDComment 7DCommentstd5 5DCommentmean 3DCommentmax 3DCommentmean

6 5DCommentmax 7DCommentsum 7DRepoststd7 5DCommentsum 7DCommentmean 5DRepostsum8 7DCommentstd 7DPostsum 3DCommentsum9 7DCommentvar 3DCommentmin 7DCommentvar10 3DCommentmean 5DCommentvar 5DCommentvar

VI CONCLUDING REMARKS

This paper presents a detailed empirical study of the re-lationship between SMA of product vendors and EPA ofconsumers interested in the vendorsrsquo products The studyis the first to characterize the correlation of specific SMAengagements ie posts reposts and comments and EPA typesthat correspond to specific stages in the consumersrsquo purchasedecisions ie search for brands and products clickthrough

product information and product orders Our analyses un-covered low-to-moderate correlations between the volumes ofSMA-EPA pairs suggesting that predicting daily volumes ofEPA based on SMA volumes alone is not well supported How-ever moderate correlations of Post-Search Post-Clickthroughand Comment-Order across vendors suggest that one may beable to train predictors of EPA distributions and their changesrather that the precise daily volumes We thus introduce a newapproach of characterizing SMA-EPA relationship in terms ofEPA quantiles We formulate EPA predictions as a multi-classcategorization problem into 2 3 and 5 quantiles Our RandomForest classifiers outperform both the random predictors andthe Logistic Regression classifiers for top quantiles of Orderand bottom quantiles of Search and Clickthrough Our studyprovides unique insights into the varied correlations of SMA-EPA pairs and a mixed success of SMA-based predictorsin determining EPA levels A general view that social me-dia engagements undoubtedly drive consumer traction on e-commerce platforms is not substantiated by the Sina Weiboand JD case study and requires further analysis In our futurework we will expand explorations of this issue with a broaderset of product categories and SMA analyses that considerproperties of the content exchanged through the SMA

VII ACKNOWLEDGEMENTS

We wish to express our gratitude to JDcom for providingaccess to invaluable data and Bin Xu and Hao Dong for theirassistance We also acknowledge the financial support from theInternational Doctoral Innovation Centre Ningbo EducationBureau Ningbo Science and Technology Bureau and theUniversity of Nottingham This work was also supported bythe UK Engineering and Physical Sciences Research Council[grant number EPL0154631]

REFERENCES

[1] J Yeo S Kim E Koh S-w Hwang and N Lipka ldquoPredicting onlinepurchase conversion for retargetingrdquo in Proceedings of the 10th ACMInternational Conference on Web Search and Data Mining pp 591ndash600ACM 2017

[2] C Lo D Frankowski and J Leskovec ldquoUnderstanding behaviors thatlead to purchasing A case study of pinterestrdquo in Proceedings of the22th ACM SIGKDD International Conference on Knowledge Discoveryand Data Mining pp 531ndash540 2016

[3] W W Moe ldquoBuying searching or browsing Differentiating betweenonline shoppers using in-store navigational clickstreamrdquo Journal ofConsumer Psychology vol 13 no 1-2 pp 29ndash39 2003

[4] C Sismeiro and R E Bucklin ldquoModeling purchase behavior at an e-commerce web site A task-completion approachrdquo Journal of MarketingResearch vol 41 no 3 pp 306ndash323 2004

[5] C Park D Kim J Oh and H Yu ldquoPredicting user purchase in e-commerce by comprehensive feature engineering and decision boundaryfocused under-samplingrdquo in Proceedings of the 2015 International ACMRecommender Systems Challenge p 8 ACM 2015

[6] D Van den Poel and W Buckinx ldquoPredicting online-purchasing be-haviourrdquo European Journal of Operational Research vol 166 no 2pp 557ndash575 2005

[7] D J Bertsimas A J Mersereau and N R Patel ldquoDynamic classifica-tion of online customersrdquo in Proceedings of the 2003 SIAM InternationalConference on Data Mining pp 107ndash118 SIAM 2003

[8] K L Kotler Philip Keller Marketing Management volume 14 PrenticeHall Englewood Cliffs 2015

[9] L Xu J A Duan and A Whinston ldquoPath to purchase A mutuallyexciting point process model for online advertising and conversionrdquoManagement Science vol 60 no 6 pp 1392ndash1412 2014

[10] X Yu Y Liu X Huang and A An ldquoMining online reviews forpredicting sales performance A case study in the movie domainrdquoIEEE Transactions on Knowledge and Data Engineering vol 24 no 4pp 720ndash734 2012

[11] E Young Kim and Y-K Kim ldquoPredicting online purchase intentionsfor clothing productsrdquo European Journal of Marketing vol 38 no 7pp 883ndash897 2004

[12] D Hawkins D Mothersbaugh and R Best ldquoConsumer behaviourBuilding marketing strategy mcgraw hillrdquo 2012

[13] G Kulkarni P Kannan and W Moe ldquoUsing online search data toforecast new product salesrdquo Decision Support Systems vol 52 no 3pp 604ndash611 2012

[14] J Wang W X Zhao H Wei H Yan and X Li ldquoMining new businessopportunities Identifying trend related products by leveraging commer-cial intents from microblogsrdquo in Proceedings of the 2013 Conferenceon Empirical Methods in Natural Language Processing pp 1337ndash13472013

[15] B Hollerit M Kroll and M Strohmaier ldquoTowards linking buyers andsellers detecting commercial intent on twitterrdquo in Proceedings of the22nd International Conference on World Wide Web pp 629ndash632 ACM2013

[16] F Kooti K Lerman L M Aiello M Grbovic N Djuric and V Ra-dosavljevic ldquoPortrait of an online shopper Understanding and predictingconsumer behaviorrdquo in Proceedings of the 9th ACM InternationalConference on Web Search and Data Mining pp 205ndash214 ACM 2016

[17] J Leskovec L A Adamic and B A Huberman ldquoThe dynamics ofviral marketingrdquo ACM Transactions on The Web (TWEB) vol 1 no 1p 5 2007

[18] A Anagnostopoulos R Kumar and M Mahdian ldquoInfluence and cor-relation in social networksrdquo in Proceedings of the 14th ACM SIGKDDInternational Conference on Knowledge Discovery and Data Miningpp 7ndash15 ACM 2008

[19] D Crandall D Cosley D Huttenlocher J Kleinberg and S SurildquoFeedback effects between similarity and social influence in onlinecommunitiesrdquo in Proceedings of the 14th ACM SIGKDD InternationalConference on Knowledge Discovery and Data Mining pp 160ndash168ACM 2008

[20] C R Shalizi and A C Thomas ldquoHomophily and contagion are generi-cally confounded in observational social network studiesrdquo SociologicalMethods amp Research vol 40 no 2 pp 211ndash239 2011

[21] V Gupta D Varshney H Jhamtani D Kedia and S Karwa ldquoIdenti-fying purchase intent from social postsrdquo in ICWSM 2014

[22] Y Zhang and M Pennacchiotti ldquoPredicting purchase behaviors fromsocial mediardquo in Proceedings of the 22nd International Conference onWorld Wide Web pp 1521ndash1532 ACM 2013

[23] X W Zhao Y Guo Y He H Jiang Y Wu and X Li ldquoWe knowwhat you want to buy a demographic-based system for product recom-mendation on microblogsrdquo in Proceedings of the 20th ACM SIGKDDInternational Conference on Knowledge Discovery and Data Miningpp 1935ndash1944 ACM 2014

[24] R Crane and D Sornette ldquoRobust dynamic classes revealed by mea-suring the response function of a social systemrdquo Proceedings of theNational Academy of Sciences vol 105 no 41 pp 15649ndash15653 2008

[25] J Lehmann B Goncalves J J Ramasco and C Cattuto ldquoDynamicalclasses of collective attention in twitterrdquo in Proceedings of the 21stInternational Conference on World Wide Web pp 251ndash260 ACM 2012

[26] D M Romero B Meeder and J Kleinberg ldquoDifferences in the mechan-ics of information diffusion across topics idioms political hashtags andcomplex contagion on twitterrdquo in Proceedings of the 20th InternationalConference on World Wide Web pp 695ndash704 ACM 2011

[27] J Yang and J Leskovec ldquoPatterns of temporal variation in onlinemediardquo in Proceedings of the 4th ACM International Conference onWeb Search and Data Mining pp 177ndash186 ACM 2011

[28] M G Akritas S A Murphy and M P Lavalley ldquoThe theil-senestimator with doubly censored data and applications to astronomyrdquoJournal of the American Statistical Association vol 90 no 429pp 170ndash177 1995

[29] D W Hosmer Jr S Lemeshow and R X Sturdivant Applied logisticregression vol 398 John Wiley amp Sons 2013

[30] H-F Yu F-L Huang and C-J Lin ldquoDual coordinate descent methodsfor logistic regression and maximum entropy modelsrdquo Machine Learn-ing vol 85 no 1 pp 41ndash75 2011

[31] A Liaw M Wiener et al ldquoClassification and regression by random-forestrdquo R News vol 2 no 3 pp 18ndash22 2002

Page 7: Social Media Brand Engagement as a Proxy for E-commerce ...eprints.nottingham.ac.uk/55994/1/Social Media Brand... · of tools and services for marketeers to manage their Social Media

TABLE V Three quantile prediction results for five categories of products across 33 vendors based on Random Forestreporting average precision by categories of 1-day (1D) 3-day (3D) and 7-day (7D) cumulative Orders Clickthrough and

Search for each quantile

Vendor 0-33 33-66 66-1001D 3D 7D 1D 3D 7D 1D 3D 7D

Order

PE 0232 0225 0237 0327 0329 0303 0522 0536 0551Sports 0241 0212 0172 0237 0240 0298 0584 0598 0607Food 0141 0139 0133 0288 0230 0251 0663 0656 0675Clothes 0287 0286 0280 0297 0272 0310 0397 0431 0478Home 0261 0266 0244 0375 0383 0351 0453 0467 0464AVG 0226 0222 0209 0310 0293 0302 0528 0540 0556

Clickthrough

PE 0488 0480 0471 0269 0303 0232 0385 0362 0372Sports 0637 0620 0623 0226 0263 0214 0232 0223 0266Food 0464 0471 0476 0282 0285 0286 0263 0266 0285Clothes 0418 0399 0423 0301 0311 0293 0331 0328 0329Home 0475 0487 0484 0327 0307 0289 0285 0278 0256AVG 0488 0486 0490 0288 0295 0271 0291 0286 0293

Search

PE 0597 0621 0655 0182 0194 0150 0277 0259 0289Sports 0621 0613 0616 0266 0248 0263 0166 0157 0152Food 0442 0413 0440 0316 0319 0292 0299 0291 0276Clothes 0455 0479 0476 0375 0370 0355 0225 0205 0172Home 0453 0474 0524 0360 0313 0298 0330 0327 0326AVG 0493 0497 0522 0315 0301 0283 0271 0261 0254

(a) Precision of bottom quantile predictionsfor Orders

(b) Precision of middle quantile predictionsfor Orders

(c) Precision of top quantile prediction forOrders

Fig 7 Precision of Random Forest predictions into three quantiles for 1-day 3-day and 7-day cumulative Orders

of 1-day 3-day and 7-day cumulative Orders for individualvendors

C Feature Significance

Both Random Forest and Logistic Regression allow us toassess the significance of individual feature types in terms oftheir contributions to the prediction decision For illustrationwe present the analysis of features for the RF classifier trainedfor 3-q predictions of the next-day EPA volumes For eachclassifier we calculate the average relative rank of a featureTable VI shows the top 10 contributing features of 1-day EPApredictors for Orders Clickthrough and Search respectivelyacross all the vendors and 3 quantiles

We observe that the relative feature importance is to adegree in agreement with the observations from the corre-lation analysis in Section V-2 Search levels are predictedby Comment and Repost activities considered over differentlengths of time 1 3 5 and 7 days Clickthrough activitiesseem to be aligned with SMA over 7 and 3 day periodsprimarily with consumer comments Orders however areclearly related to Comment volumes over a longer periodsie 5 and 7 days Overall we recognize the importance of

features generated from Comment activities as they contributeto the predictors of all EPA types

TABLE VI Ten top-ranked features of the 3-q RandomForest predictors for the next-day EPA

Rank Order Clickthrough Search1 7DCommentmin 7DCommentmin 7DCommentmin

2 7DCommentsum 3DCommentmean 5DRepostmin

3 7DCommentmean 3DCommentsum PreviousDComment4 7DCommentmax PreviousDComment 7DCommentstd5 5DCommentmean 3DCommentmax 3DCommentmean

6 5DCommentmax 7DCommentsum 7DRepoststd7 5DCommentsum 7DCommentmean 5DRepostsum8 7DCommentstd 7DPostsum 3DCommentsum9 7DCommentvar 3DCommentmin 7DCommentvar10 3DCommentmean 5DCommentvar 5DCommentvar

VI CONCLUDING REMARKS

This paper presents a detailed empirical study of the re-lationship between SMA of product vendors and EPA ofconsumers interested in the vendorsrsquo products The studyis the first to characterize the correlation of specific SMAengagements ie posts reposts and comments and EPA typesthat correspond to specific stages in the consumersrsquo purchasedecisions ie search for brands and products clickthrough

product information and product orders Our analyses un-covered low-to-moderate correlations between the volumes ofSMA-EPA pairs suggesting that predicting daily volumes ofEPA based on SMA volumes alone is not well supported How-ever moderate correlations of Post-Search Post-Clickthroughand Comment-Order across vendors suggest that one may beable to train predictors of EPA distributions and their changesrather that the precise daily volumes We thus introduce a newapproach of characterizing SMA-EPA relationship in terms ofEPA quantiles We formulate EPA predictions as a multi-classcategorization problem into 2 3 and 5 quantiles Our RandomForest classifiers outperform both the random predictors andthe Logistic Regression classifiers for top quantiles of Orderand bottom quantiles of Search and Clickthrough Our studyprovides unique insights into the varied correlations of SMA-EPA pairs and a mixed success of SMA-based predictorsin determining EPA levels A general view that social me-dia engagements undoubtedly drive consumer traction on e-commerce platforms is not substantiated by the Sina Weiboand JD case study and requires further analysis In our futurework we will expand explorations of this issue with a broaderset of product categories and SMA analyses that considerproperties of the content exchanged through the SMA

VII ACKNOWLEDGEMENTS

We wish to express our gratitude to JDcom for providingaccess to invaluable data and Bin Xu and Hao Dong for theirassistance We also acknowledge the financial support from theInternational Doctoral Innovation Centre Ningbo EducationBureau Ningbo Science and Technology Bureau and theUniversity of Nottingham This work was also supported bythe UK Engineering and Physical Sciences Research Council[grant number EPL0154631]

REFERENCES

[1] J Yeo S Kim E Koh S-w Hwang and N Lipka ldquoPredicting onlinepurchase conversion for retargetingrdquo in Proceedings of the 10th ACMInternational Conference on Web Search and Data Mining pp 591ndash600ACM 2017

[2] C Lo D Frankowski and J Leskovec ldquoUnderstanding behaviors thatlead to purchasing A case study of pinterestrdquo in Proceedings of the22th ACM SIGKDD International Conference on Knowledge Discoveryand Data Mining pp 531ndash540 2016

[3] W W Moe ldquoBuying searching or browsing Differentiating betweenonline shoppers using in-store navigational clickstreamrdquo Journal ofConsumer Psychology vol 13 no 1-2 pp 29ndash39 2003

[4] C Sismeiro and R E Bucklin ldquoModeling purchase behavior at an e-commerce web site A task-completion approachrdquo Journal of MarketingResearch vol 41 no 3 pp 306ndash323 2004

[5] C Park D Kim J Oh and H Yu ldquoPredicting user purchase in e-commerce by comprehensive feature engineering and decision boundaryfocused under-samplingrdquo in Proceedings of the 2015 International ACMRecommender Systems Challenge p 8 ACM 2015

[6] D Van den Poel and W Buckinx ldquoPredicting online-purchasing be-haviourrdquo European Journal of Operational Research vol 166 no 2pp 557ndash575 2005

[7] D J Bertsimas A J Mersereau and N R Patel ldquoDynamic classifica-tion of online customersrdquo in Proceedings of the 2003 SIAM InternationalConference on Data Mining pp 107ndash118 SIAM 2003

[8] K L Kotler Philip Keller Marketing Management volume 14 PrenticeHall Englewood Cliffs 2015

[9] L Xu J A Duan and A Whinston ldquoPath to purchase A mutuallyexciting point process model for online advertising and conversionrdquoManagement Science vol 60 no 6 pp 1392ndash1412 2014

[10] X Yu Y Liu X Huang and A An ldquoMining online reviews forpredicting sales performance A case study in the movie domainrdquoIEEE Transactions on Knowledge and Data Engineering vol 24 no 4pp 720ndash734 2012

[11] E Young Kim and Y-K Kim ldquoPredicting online purchase intentionsfor clothing productsrdquo European Journal of Marketing vol 38 no 7pp 883ndash897 2004

[12] D Hawkins D Mothersbaugh and R Best ldquoConsumer behaviourBuilding marketing strategy mcgraw hillrdquo 2012

[13] G Kulkarni P Kannan and W Moe ldquoUsing online search data toforecast new product salesrdquo Decision Support Systems vol 52 no 3pp 604ndash611 2012

[14] J Wang W X Zhao H Wei H Yan and X Li ldquoMining new businessopportunities Identifying trend related products by leveraging commer-cial intents from microblogsrdquo in Proceedings of the 2013 Conferenceon Empirical Methods in Natural Language Processing pp 1337ndash13472013

[15] B Hollerit M Kroll and M Strohmaier ldquoTowards linking buyers andsellers detecting commercial intent on twitterrdquo in Proceedings of the22nd International Conference on World Wide Web pp 629ndash632 ACM2013

[16] F Kooti K Lerman L M Aiello M Grbovic N Djuric and V Ra-dosavljevic ldquoPortrait of an online shopper Understanding and predictingconsumer behaviorrdquo in Proceedings of the 9th ACM InternationalConference on Web Search and Data Mining pp 205ndash214 ACM 2016

[17] J Leskovec L A Adamic and B A Huberman ldquoThe dynamics ofviral marketingrdquo ACM Transactions on The Web (TWEB) vol 1 no 1p 5 2007

[18] A Anagnostopoulos R Kumar and M Mahdian ldquoInfluence and cor-relation in social networksrdquo in Proceedings of the 14th ACM SIGKDDInternational Conference on Knowledge Discovery and Data Miningpp 7ndash15 ACM 2008

[19] D Crandall D Cosley D Huttenlocher J Kleinberg and S SurildquoFeedback effects between similarity and social influence in onlinecommunitiesrdquo in Proceedings of the 14th ACM SIGKDD InternationalConference on Knowledge Discovery and Data Mining pp 160ndash168ACM 2008

[20] C R Shalizi and A C Thomas ldquoHomophily and contagion are generi-cally confounded in observational social network studiesrdquo SociologicalMethods amp Research vol 40 no 2 pp 211ndash239 2011

[21] V Gupta D Varshney H Jhamtani D Kedia and S Karwa ldquoIdenti-fying purchase intent from social postsrdquo in ICWSM 2014

[22] Y Zhang and M Pennacchiotti ldquoPredicting purchase behaviors fromsocial mediardquo in Proceedings of the 22nd International Conference onWorld Wide Web pp 1521ndash1532 ACM 2013

[23] X W Zhao Y Guo Y He H Jiang Y Wu and X Li ldquoWe knowwhat you want to buy a demographic-based system for product recom-mendation on microblogsrdquo in Proceedings of the 20th ACM SIGKDDInternational Conference on Knowledge Discovery and Data Miningpp 1935ndash1944 ACM 2014

[24] R Crane and D Sornette ldquoRobust dynamic classes revealed by mea-suring the response function of a social systemrdquo Proceedings of theNational Academy of Sciences vol 105 no 41 pp 15649ndash15653 2008

[25] J Lehmann B Goncalves J J Ramasco and C Cattuto ldquoDynamicalclasses of collective attention in twitterrdquo in Proceedings of the 21stInternational Conference on World Wide Web pp 251ndash260 ACM 2012

[26] D M Romero B Meeder and J Kleinberg ldquoDifferences in the mechan-ics of information diffusion across topics idioms political hashtags andcomplex contagion on twitterrdquo in Proceedings of the 20th InternationalConference on World Wide Web pp 695ndash704 ACM 2011

[27] J Yang and J Leskovec ldquoPatterns of temporal variation in onlinemediardquo in Proceedings of the 4th ACM International Conference onWeb Search and Data Mining pp 177ndash186 ACM 2011

[28] M G Akritas S A Murphy and M P Lavalley ldquoThe theil-senestimator with doubly censored data and applications to astronomyrdquoJournal of the American Statistical Association vol 90 no 429pp 170ndash177 1995

[29] D W Hosmer Jr S Lemeshow and R X Sturdivant Applied logisticregression vol 398 John Wiley amp Sons 2013

[30] H-F Yu F-L Huang and C-J Lin ldquoDual coordinate descent methodsfor logistic regression and maximum entropy modelsrdquo Machine Learn-ing vol 85 no 1 pp 41ndash75 2011

[31] A Liaw M Wiener et al ldquoClassification and regression by random-forestrdquo R News vol 2 no 3 pp 18ndash22 2002

Page 8: Social Media Brand Engagement as a Proxy for E-commerce ...eprints.nottingham.ac.uk/55994/1/Social Media Brand... · of tools and services for marketeers to manage their Social Media

product information and product orders Our analyses un-covered low-to-moderate correlations between the volumes ofSMA-EPA pairs suggesting that predicting daily volumes ofEPA based on SMA volumes alone is not well supported How-ever moderate correlations of Post-Search Post-Clickthroughand Comment-Order across vendors suggest that one may beable to train predictors of EPA distributions and their changesrather that the precise daily volumes We thus introduce a newapproach of characterizing SMA-EPA relationship in terms ofEPA quantiles We formulate EPA predictions as a multi-classcategorization problem into 2 3 and 5 quantiles Our RandomForest classifiers outperform both the random predictors andthe Logistic Regression classifiers for top quantiles of Orderand bottom quantiles of Search and Clickthrough Our studyprovides unique insights into the varied correlations of SMA-EPA pairs and a mixed success of SMA-based predictorsin determining EPA levels A general view that social me-dia engagements undoubtedly drive consumer traction on e-commerce platforms is not substantiated by the Sina Weiboand JD case study and requires further analysis In our futurework we will expand explorations of this issue with a broaderset of product categories and SMA analyses that considerproperties of the content exchanged through the SMA

VII ACKNOWLEDGEMENTS

We wish to express our gratitude to JDcom for providingaccess to invaluable data and Bin Xu and Hao Dong for theirassistance We also acknowledge the financial support from theInternational Doctoral Innovation Centre Ningbo EducationBureau Ningbo Science and Technology Bureau and theUniversity of Nottingham This work was also supported bythe UK Engineering and Physical Sciences Research Council[grant number EPL0154631]

REFERENCES

[1] J Yeo S Kim E Koh S-w Hwang and N Lipka ldquoPredicting onlinepurchase conversion for retargetingrdquo in Proceedings of the 10th ACMInternational Conference on Web Search and Data Mining pp 591ndash600ACM 2017

[2] C Lo D Frankowski and J Leskovec ldquoUnderstanding behaviors thatlead to purchasing A case study of pinterestrdquo in Proceedings of the22th ACM SIGKDD International Conference on Knowledge Discoveryand Data Mining pp 531ndash540 2016

[3] W W Moe ldquoBuying searching or browsing Differentiating betweenonline shoppers using in-store navigational clickstreamrdquo Journal ofConsumer Psychology vol 13 no 1-2 pp 29ndash39 2003

[4] C Sismeiro and R E Bucklin ldquoModeling purchase behavior at an e-commerce web site A task-completion approachrdquo Journal of MarketingResearch vol 41 no 3 pp 306ndash323 2004

[5] C Park D Kim J Oh and H Yu ldquoPredicting user purchase in e-commerce by comprehensive feature engineering and decision boundaryfocused under-samplingrdquo in Proceedings of the 2015 International ACMRecommender Systems Challenge p 8 ACM 2015

[6] D Van den Poel and W Buckinx ldquoPredicting online-purchasing be-haviourrdquo European Journal of Operational Research vol 166 no 2pp 557ndash575 2005

[7] D J Bertsimas A J Mersereau and N R Patel ldquoDynamic classifica-tion of online customersrdquo in Proceedings of the 2003 SIAM InternationalConference on Data Mining pp 107ndash118 SIAM 2003

[8] K L Kotler Philip Keller Marketing Management volume 14 PrenticeHall Englewood Cliffs 2015

[9] L Xu J A Duan and A Whinston ldquoPath to purchase A mutuallyexciting point process model for online advertising and conversionrdquoManagement Science vol 60 no 6 pp 1392ndash1412 2014

[10] X Yu Y Liu X Huang and A An ldquoMining online reviews forpredicting sales performance A case study in the movie domainrdquoIEEE Transactions on Knowledge and Data Engineering vol 24 no 4pp 720ndash734 2012

[11] E Young Kim and Y-K Kim ldquoPredicting online purchase intentionsfor clothing productsrdquo European Journal of Marketing vol 38 no 7pp 883ndash897 2004

[12] D Hawkins D Mothersbaugh and R Best ldquoConsumer behaviourBuilding marketing strategy mcgraw hillrdquo 2012

[13] G Kulkarni P Kannan and W Moe ldquoUsing online search data toforecast new product salesrdquo Decision Support Systems vol 52 no 3pp 604ndash611 2012

[14] J Wang W X Zhao H Wei H Yan and X Li ldquoMining new businessopportunities Identifying trend related products by leveraging commer-cial intents from microblogsrdquo in Proceedings of the 2013 Conferenceon Empirical Methods in Natural Language Processing pp 1337ndash13472013

[15] B Hollerit M Kroll and M Strohmaier ldquoTowards linking buyers andsellers detecting commercial intent on twitterrdquo in Proceedings of the22nd International Conference on World Wide Web pp 629ndash632 ACM2013

[16] F Kooti K Lerman L M Aiello M Grbovic N Djuric and V Ra-dosavljevic ldquoPortrait of an online shopper Understanding and predictingconsumer behaviorrdquo in Proceedings of the 9th ACM InternationalConference on Web Search and Data Mining pp 205ndash214 ACM 2016

[17] J Leskovec L A Adamic and B A Huberman ldquoThe dynamics ofviral marketingrdquo ACM Transactions on The Web (TWEB) vol 1 no 1p 5 2007

[18] A Anagnostopoulos R Kumar and M Mahdian ldquoInfluence and cor-relation in social networksrdquo in Proceedings of the 14th ACM SIGKDDInternational Conference on Knowledge Discovery and Data Miningpp 7ndash15 ACM 2008

[19] D Crandall D Cosley D Huttenlocher J Kleinberg and S SurildquoFeedback effects between similarity and social influence in onlinecommunitiesrdquo in Proceedings of the 14th ACM SIGKDD InternationalConference on Knowledge Discovery and Data Mining pp 160ndash168ACM 2008

[20] C R Shalizi and A C Thomas ldquoHomophily and contagion are generi-cally confounded in observational social network studiesrdquo SociologicalMethods amp Research vol 40 no 2 pp 211ndash239 2011

[21] V Gupta D Varshney H Jhamtani D Kedia and S Karwa ldquoIdenti-fying purchase intent from social postsrdquo in ICWSM 2014

[22] Y Zhang and M Pennacchiotti ldquoPredicting purchase behaviors fromsocial mediardquo in Proceedings of the 22nd International Conference onWorld Wide Web pp 1521ndash1532 ACM 2013

[23] X W Zhao Y Guo Y He H Jiang Y Wu and X Li ldquoWe knowwhat you want to buy a demographic-based system for product recom-mendation on microblogsrdquo in Proceedings of the 20th ACM SIGKDDInternational Conference on Knowledge Discovery and Data Miningpp 1935ndash1944 ACM 2014

[24] R Crane and D Sornette ldquoRobust dynamic classes revealed by mea-suring the response function of a social systemrdquo Proceedings of theNational Academy of Sciences vol 105 no 41 pp 15649ndash15653 2008

[25] J Lehmann B Goncalves J J Ramasco and C Cattuto ldquoDynamicalclasses of collective attention in twitterrdquo in Proceedings of the 21stInternational Conference on World Wide Web pp 251ndash260 ACM 2012

[26] D M Romero B Meeder and J Kleinberg ldquoDifferences in the mechan-ics of information diffusion across topics idioms political hashtags andcomplex contagion on twitterrdquo in Proceedings of the 20th InternationalConference on World Wide Web pp 695ndash704 ACM 2011

[27] J Yang and J Leskovec ldquoPatterns of temporal variation in onlinemediardquo in Proceedings of the 4th ACM International Conference onWeb Search and Data Mining pp 177ndash186 ACM 2011

[28] M G Akritas S A Murphy and M P Lavalley ldquoThe theil-senestimator with doubly censored data and applications to astronomyrdquoJournal of the American Statistical Association vol 90 no 429pp 170ndash177 1995

[29] D W Hosmer Jr S Lemeshow and R X Sturdivant Applied logisticregression vol 398 John Wiley amp Sons 2013

[30] H-F Yu F-L Huang and C-J Lin ldquoDual coordinate descent methodsfor logistic regression and maximum entropy modelsrdquo Machine Learn-ing vol 85 no 1 pp 41ndash75 2011

[31] A Liaw M Wiener et al ldquoClassification and regression by random-forestrdquo R News vol 2 no 3 pp 18ndash22 2002