the chinese voting process: the evolution of online

15
Proceedings of the 12th INFORMS Workshop on Data Mining and Decision Analytics (Houston - 2017) C. Iyigun, R. Moghaddass, M. Samoran, eds. The Chinese Voting Process: The Evolution of Online Helpfulness Evaluations in Product Reviews and Question Answers Moontae Lee Computer Science, Cornell University, [email protected] Seok Hyun Jin Amazon, [email protected] David Mimno Information Science, Cornell University, [email protected] Abstract. Helpfulness voting is a popular method for highlighting the usefulness of user-contributed responses such as reviews of products and answers to questions. However such voting is a social process that can gain momentum, biasing toward responses that are already popular. The positional accessibility of individual re- sponses and the presented polarity of existing votes can all affect subsequent votes. Less helpful responses that receive an early boost could block users from accessing more useful future information. Moreover, the momentum may differ between forums, so solutions must be tailored to the micro-cultures of different communities. We propose the Chinese Voting Process (CVP) that can model the evolution of online helpful- ness votes as a self-reinforcing random process based on position and presentation biases. We evaluate this model on Amazon product reviews and more than 80 major StackExchange forums, inferring the intrinsic quality of individual responses and measuring Trendiness and Conformity of different communities. Keywords: product reviews, question answers, helpfulness voting, online evaluation, biases. 1 Introduction As online social platforms grow, user-generated content has become important information for de- cision making and knowledge discovery. Reading online product reviews strongly influences the buying decision process of consumers [1]. While editorial reviews provide expert information [2], customer reviews in e-commerce like Amazon are often more helpful [3] and impact positively on product sales [4]. Question answers in Q&A forums are vital for coders and researchers [5]. StackExchange is a popular collaborative platform that consists of more than a hundred online Q&A forums including its flagship: StackOverflow. According to the report by Ericson 1 , overall 1 Stack Exchange Year in Review 2015 and 2016 by Jon Ericson. (https://stackoverflow.blog/) 1

Upload: others

Post on 04-May-2022

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: The Chinese Voting Process: The Evolution of Online

Proceedings of the 12th INFORMS Workshop on Data Mining and Decision Analytics (Houston - 2017)C. Iyigun, R. Moghaddass, M. Samoran, eds.

The Chinese Voting Process:The Evolution of Online Helpfulness Evaluations in

Product Reviews and Question AnswersMoontae Lee

Computer Science, Cornell University, [email protected]

Seok Hyun JinAmazon, [email protected]

David MimnoInformation Science, Cornell University, [email protected]

Abstract.

Helpfulness voting is a popular method for highlighting the usefulness of user-contributed responses suchas reviews of products and answers to questions. However such voting is a social process that can gainmomentum, biasing toward responses that are already popular. The positional accessibility of individual re-sponses and the presented polarity of existing votes can all affect subsequent votes. Less helpful responsesthat receive an early boost could block users from accessing more useful future information. Moreover,the momentum may differ between forums, so solutions must be tailored to the micro-cultures of differentcommunities. We propose the Chinese Voting Process (CVP) that can model the evolution of online helpful-ness votes as a self-reinforcing random process based on position and presentation biases. We evaluate thismodel on Amazon product reviews and more than 80 major StackExchange forums, inferring the intrinsicquality of individual responses and measuring Trendiness and Conformity of different communities.

Keywords: product reviews, question answers, helpfulness voting, online evaluation, biases.

1 Introduction

As online social platforms grow, user-generated content has become important information for de-cision making and knowledge discovery. Reading online product reviews strongly influences thebuying decision process of consumers [1]. While editorial reviews provide expert information [2],customer reviews in e-commerce like Amazon are often more helpful [3] and impact positivelyon product sales [4]. Question answers in Q&A forums are vital for coders and researchers [5].StackExchange is a popular collaborative platform that consists of more than a hundred onlineQ&A forums including its flagship: StackOverflow. According to the report by Ericson1, overall

1Stack Exchange Year in Review 2015 and 2016 by Jon Ericson. (https://stackoverflow.blog/)

1

Page 2: The Chinese Voting Process: The Evolution of Online

3.6 million new questions and more than 4.5 million new answers were generated in 2016 alone, asubstantial increase from 2.5 million new questions and 3.2 million new answers posted in 2015.

But volume does not mean quality. Diversity and abundance of user content demands better accessto more useful information [6]. Helpfulness voting is a powerful means to evaluate the quality ofuser responses (i.e., reviews/answers) by the wisdom of crowds [7]. Having quality reviews mea-sured by helpfulness votes improves product sales [4] as well as having positive reviews [8]. Whilethese votes are generally valuable in aggregate, there is strong evidence that helpfulness voting issocially influenced. Helpfulness ratings on Amazon product reviews differ significantly from thoseestimated by independent human annotators [9]. Votes are generally more positive, and the numberof votes rapidly decreases with respect to the displayed position of responses [10]. Review polarityis biased towards matching the consensus opinion [11].2 Thus estimating the true quality of theresponses is difficult, motivating us to propose a new model that is capable of learning the intrinsicquality of responses by considering their social contexts and momentum.

Previous work in online evaluations shows that although inherent quality is an important factorin overall ranking, users are susceptible to position bias [12, 13]. Displaying items in an orderaffects users: top-ranked items gain more popularity, while low-ranked items remain in obscurity.In this work we measure sensitivity to order and find that it differs across communities: somevalue a range of opinions, while others prefer a single authoritative answer. Summary informa-tion displayed together can lead to presentation bias [14, 15]. As the current voting scores arevisibly presented with responses, users inevitably perceive the score before reading the contentsof responses. Such exposure could immediately nudge user evaluations toward the majority opin-ion, making high-scored responses more attractive. We also find that the relative length of eachresponse affects the polarity of future votes similar to the effect of cognitive heuristics in [16].

Table 1: Quality interpretation for each sequence of six votes.

Res Votes Diff Ratio Relative Quality1 +++−−− 0 0.5 quite negative2 +−+−+− 0 0.5 moderately negative3 −+−+−+ 0 0.5 moderately positive4 −−−+++ 0 0.5 quite positive

Standard discrete models for self-rein-forcing processes include the ChineseRestaurant Process and the Polya urnmodel [17]. Since these models areexchangeable, the order of events doesnot affect the probability of a sequence.However, Table 1 suggests how different contexts of votes cause different impacts. While the foursequences have equal numbers of positive and negative votes in aggregate, the fourth votes in the

2When two reviews contain essentially the same text but differ only in their star rating, the review closer to theconsensus star rating is considered more helpful.

2

Page 3: The Chinese Voting Process: The Evolution of Online

0.2 0.4 0.6 0.8 1.0

TRENDINESS (SENSITIVITY TO POPULAR RESPONSES)

2.0

2.5

3.0

3.5

4.0

4.5

CO

MFO

RM

ITY

(PO

SIT

IVIT

YT

OM

AJO

RIT

YV

OT

ES)

english

matheducators

codereview

amazon

bitcoin

gis

writers

quant

academiaaviation

scifi

math

gamedev

hermeneutics

drupalraspberrypi

superuser

sound

webmasters

stats

german

graphicdesign

webappscs

linguistics

photobicycles

homebrew

fitness

physics

boardgameselectronics

productivity

serverfaultwordpress

cstheory

money

unix

pm

applechess

mechanics

chinese

ux

stackoverflowMETA

martialarts

russian

security

music

spanish

serverfaultMETA

android

outdoors

dba

islam

mathMETA

travel

french

judaism

cooking

mathoverflow

askubuntu

gamingrpg

gamingMETA

programmers

history

photoMETA

sharepoint

diy

workplace

stackexchangeMETA

sqa

portugues

philosophy

ell

salesforce

movies magento

codegolf

christianity

parentingstackoverflow

MEAN

theoryleisurelanguagehumanitiesmeta

hobby/sidelinecontent management

power usersmanagement

codingadmin

Figure 1: 2D Community embedding. Each of 83 communities is positioned according to two behavioral coefficients(Trendiness, Conformity). Eleven clusters are grouped based on their common focus. The MEAN community issynthesized by sampling 20 questions from every community (except Amazon due to its distinct user interface).

first and last responses are given against a clear majority opinion. Our model treats objection tothe majority as a more challenging decision that entails higher weight. In contrast, the middle twosequences receive alternating votes, so that there is never a clear majority opinion. As each vote is— relatively — a weaker disagreement, the underlying quality is moderate compared to the othertwo responses. Furthermore, if these are responses to one item, the order between them also mat-ters. If the initial three votes on the fourth response pushed its display position to the next page,for example, it might not have a chance to receive future votes, which might recover its reputation.

The Chinese Voting Process (CVP) models generation of responses and votes, formalizing theevolution of helpfulness under positional and presentational reinforcement. Whereas most previ-ous work on helpfulness prediction [18, 19, 9, 11, 20, 21, 22] has involved a single snapshot – onlythe moment when the data is collected – the CVP estimate intrinsic quality of responses based onmultiple snapshots over time. Note also that these studies have attempted to predict helpfulnessgiven content-based features such as review text, star-ratings, sales, and badges. In contrast, weonly leverage voting sequences and their social contexts, building a scalable model to which onecan easily incorporate any content-based feature. We combine data on Amazon helpfulness votesfrom [23] with a much larger collection of helpfulness votes from 82 major StackExchange forums,providing integrated analysis of our model for various user-generated content and its evaluations.The resulting model shows significant improvements in predicting next helpfulness votes, espe-

3

Page 4: The Chinese Voting Process: The Evolution of Online

cially in the critical early stages of a voting trajectory. We also verify that the CVP-estimatedintrinsic quality has more correlation with the sentiment of user comments associated with eachresponse than simple vote-based rankings. Finally, we compare different characteristics of self-reinforcing behavior across communities using two behavioral coefficients: Trendiness and Con-

formity. Given the learned parameters for each community, Trendiness measures the positionalsensitivity to salient responses, whereas Conformity captures the propensity to obey the presentedvalence. Figure 1 illustrates different voting dynamics from Judaism to Javascript (in StackOver-flow) characterized by Trendiness and Conformity. Such contrast across communities can inspirefurther research in finding distinctive treatments to alleviate different degrees of voting biases.

2 The Chinese Voting Process

Our goal is to model helpfulness voting as a two-phase self-reinforcing stochastic process. In theselection phase, each user either selects an existing response based on their displayed positionsor writes a new response. The positional reinforcement is inspired by the Chinese RestaurantProcess (CRP) and the Distance Dependent Chinese Restaurant Process (ddCRP). In the voting

phase, when one response is selected, the user chooses one of the two feedback options: a positiveor negative vote based on the intrinsic quality and the presentational factors. The presentationalreinforcement is modeled by a log-linear model with time-varying features based on the Polya urnmodel. The CVP implements the-rich-get-richer dynamics as an interplay of these two preferentialreinforcements, learning latent qualities of individual responses as inspired in Table 1. Specifically,each user at time t interested in the item i follows the generative story in Table 2.

Table 2: The generative story and the parametrization of the Chinese Voting Process (CVP).

Generative process Full parametrization (e.g., Amazon)

1. Evaluate j-th response: p(z(t)i = j|z(1:t−1)i ;α) ∝ f(t−1)i (j)

(a) ‘Yes’: p(v(t)i = 1|θ) = logit−1(qij + g(t−1)i (j))

(b) ‘No’ : p(v(t)i = 0|θ) = 1− p(v(t)i = 1|θ)

2. Or write a new response: p(z(t)i = Ji + 1|z(1:t−1)i ;α) ∝ α(a) Sample qi(J+1) from N (0, σ2).

f(t)i (j) =

( 1

1 + the-display-rank(t)i (j)

)τg(t)i (j) = λr

(t)ij + µs

(t)ij + νiu

(t)ij

θ = {{qij}, λ, µ, {νi}}

Ji = J(t−1)i (abbreviated notation)

2.1 Selection phase

The CRP [17, 24] is a self-reinforcing decision process over an infinite discrete set. For each item(product/question) i, the first user writes a new response (review/answer). The t-th subsequent usercan choose an existing response j out of J(t−1)i possible responses with probability proportional tothe number of votes n(t−1)j given to the response j by time t− 1, whereas the probability of writing

4

Page 5: The Chinese Voting Process: The Evolution of Online

Table 3: Change of quality estimation qj over times for the first two example responses in Table 1 with the initialpseudo-votes (x, y, w) = (1, 1, 1). The estimated quality at the first response sharply decreases from the first majority-against vote at t = 4, and it ends up being more negative than the second response, even if they receive the identicalvotes in aggregate. These non-exchangeable behaviors cannot be modeled with a simple exchangeable process.

t or T v(t)1 r

(t)1 s

(t)1 ∆

(t)1 qT1 v

(t)2 r

(t)2 s

(t)2 ∆

(t)2 qT2

0 1/2 1/2 1/2 1/21 + 2/3 1/3 0.167 − + 2/3 1/3 0.167 −2 + 3/4 1/4 0.083 0.363 − 2/4 2/4 0.167 -0.3633 + 4/5 1/5 0.050 0.574 + 3/5 2/5 0.100 0.0044 − 4/6 2/6 0.133 0.237 − 3/6 3/6 0.100 -0.2305 − 4/7 3/7 0.095 0.004 + 4/7 3/7 0.071 0.0076 − 4/8 4/8 0.071 -0.175 − 4/8 4/8 0.071 -0.166

a new response J(t−1)i +1 is proportional to a constant α. While the CRP models self-reinforcement— each vote for a response makes that response more likely to be selected later — there is evidencethat the actual selection rate in an ordered list decays with display rank [10]. Since such rankingsare mechanism-specific and not always clearly known in advance, we need a more flexible modelthat can specify various degrees of positional preference. The ddCRP [25] introduces a function fthat decays with respect to some distance measure. In our formulation, the distance function variesover time and is further configurable with respect to the specific user-interface of service providers.

Specifically, the function f(t)i (j) in the CVP evaluates the popularity of the j-th response in the itemi at time t. Since we assume that popularity of responses is decided by their positional accessibility,we can parametrize f to be inversely proportional to their display ranks. The exponent τ determinessensitivity to popularity in the selection phase by controlling the degree of harmonic penalizationover ranks. Larger τ > 0 indicates that users are more sensitive to trendy responses displayed nearthe top. If τ < 0, users often select low-ranked responses over high-ranked ones for some reasons.3

Note that even if the user at time t does not vote on the j-th response, f(t)i (j) could be differentfrom f

(t−1)i (j) in the CVP,4 whereas n(t)ij = n

(t−1)ij in the CRP. Thus one can view the selection

phase of the CVP as a non-exchangeable extension of the CRP via a time-varying function f .

2.2 Voting phase

We next construct a self-reinforcing process for the voting phase. The Polya urn model is a self-reinforcing decision process over a finite discrete set, but because it is exchangeable, it is unableto capture contextual information encoded in each a sequence of votes. We instead use a log-linear

3Responses at the bottom side are sometimes selected especially in the early stage when only a few responses existand they are all easily accessible without scrolling burden.

4Say the rank of another response j′ was lower than j’s at time t− 1. If t-th vote given to the response j′ raises itsrank higher than the rank of the response j, then f(t)i (j) < f

(t−1)i (j) assuming τ > 0.

5

Page 6: The Chinese Voting Process: The Evolution of Online

formulation with the urn-based features, allowing other presentational or content-based features tobe flexibly incorporated based on the modeler’s observations. Each response initially has x = x(0)

positive and y = y(0) negative votes, which could be fractional pseudo-votes. For each draw of avote, we return w + 1 votes with the same polarity, thus self-reinforcing when w > 0. Table 3shows time-evolving positive/negative ratios r(t)j = x

(t)j /(x

(t)j + y

(t)j ) and s(t)j = y

(t)j /(x

(t)j + y

(t)j ) of

the first two responses: j ∈ {1, 2} in Table 1 with the corresponding ratio gain ∆(t)j =r

(t)j −r

(t−1)j (if

v(t)j = 1 or +) or s(t)j − s

(t−1)j (if v(t)j = 0 or−). In this toy setting, the polarity of a vote to a response

is an outcome of its intrinsic quality as well as presentational factors: positive and negative votes.Thus we model each sequence of votes by `2-regularized logistic regression with the latent intrinsicquality and the Polya urn ratios.5

maxθ

logT∏t=2

logit−1(qTj + λr(t−1)j + µs

(t−1)j )− 1

2‖θ‖22 where θ = (qTj , λ, µ) (1)

The {qTj } in the Table 3 shows the learned intrinsic quality from solving (1) up to T -th votes foreach j ∈ {1, 2}. The initial vote given at t = 1 is disregarded in the training due to its arbitrarinessfrom the uniform prior (x0 = y0). Since it is quite possible to have only positive or only negativevotes, Gaussian regularization is necessary to prevent divergence of parameters. Note that usingthe urn-based ratio features is essential to encode contextual information. If we instead use rawcount features (only the numerators of rj and sj), for example in the first response, the estimatedquality qT1 keeps increasing even after getting negative votes from time 4 to 6. Log raw count fea-tures are unable to infer the negative quality.

In the first response, ∆(t)1 shows the decreasing gain in positive ratios from t=1 to 3 and in negative

ratios from t= 4 to 6, whereas it gains a relatively large momentum at the first negative vote whent = 4. ∆

(t)2 converges to 0 in the 2nd response, implying that future votes have less effect than

earlier votes for alternating +/− votes. The estimated quality qT2 also converges to 0 as we expectneutral quality in the limit. Overall the model is capable of learning intrinsic quality as desired inTable 1 where relative gains can be further controlled by tuning the initial pseudo-votes (x, y).

In the real setting, the polarity score function g(t)i (j) in the CVP evaluates presentational factorsof the j-th response in the item i at time t. Because we adopt a log-linear formulation, one caneasily incorporate additional heuristics about voting decision. In addition to the positive ratio r(t)ijand the negative ratio s(t)ij , g also contains a length feature u(t)ij (as given in Table 2), which is the

5One might think (1) can be equivalently achievable with only two parameters because of r(t)j + s(t)j = 1 for all

t. However, such reparametrization adds inconsistent translations to qTj and makes it difficult to interpret differentinclinations between positive and negative votes for various communities.

6

Page 7: The Chinese Voting Process: The Evolution of Online

Table 4: Change of quality estimation qj over times for one-side parametrization. The same example responses inTable 1 are used with the initial pseudo-votes (x,w) = (1, 1). The quality estimation is analogous to Table 3.

t or T v(t)1 r

(t)1 ∆

(t)1 qT1 v

(t)2 r

(t)2 ∆

(t)2 qT2

0 1/2 1/21 + 2/3 0.167 − + 2/3 0.167 −2 + 3/4 0.083 0.370 − 1/2 −0.167 -0.3703 + 4/5 0.050 0.586 + 2/3 0.167 0.0154 − 3/4 −0.050 0.248 − 1/2 −0.167 -0.2235 − 2/3 −0.083 0.019 + 2/3 0.167 0.0416 − 1/2 −0.167 -0.161 − 1/2 −0.167 -0.131

relative length of the response j against the average length of responses in the item i at particulartime t. Users in some items may prefer shorter responses than longer ones for brevity, whereasusers in other items may blindly believe that longer responses are more credible before readingtheir contents. The parameter νi explains length-wise preferential idiosyncrasy as a per-item bias:νi < 0 means a preference toward the shorter responses. Note that g(t)i (j) could be different fromg(t−1)i (j) even if the user at time t does not choose to vote.6 All together, the voting phase of the

CVP generates non-exchangeable votes.

Single-side parametrization: Whereas online communities like Amazon product reviews visi-bly present the total numbers of both positive and negative votes,7 users in StackExchange forumscan only observe the current vote difference.8 As positive votes are mostly dominant in overall fo-rums, users hardly aware of the existence of negative votes unless responses receive more negativevotes than the positive votes. For the voting phase of communities whose user-interface is simi-lar to StackExchange platform, therefore, one signed (speed-like) feature r replaces two unsigned(velocity-like) features (r, s). The corresponding result is described in Table 4.

3 Inference

Each phase of the CVP depends on the result of all previous stages, so decoupling these relatedproblems is crucial for efficient inference. Given the dataset9 and hyperparameters (α, σ2), we needto estimate community-level parameters (τ, γ, µ), item-level length preferences (νi), and response-level intrinsic qualities (qij). We further compute two community-level behavioral coefficients:Trendiness (τ ) and Conformity (κ), which are useful statistics for comparing different voting pat-terns across different communities.

6If a new response is written at time t, u(t)ij 6= u(t−1)ij as the new response changes the average length.

7The user-interface is recently changed to only show the number of positive votes. When the data was collected,however, users casted their votes to each review provided with “x of x+ y people found the following review helpful.”

8The numbers between up/down voting arrows in Table 2 indicate their upvotes minus downvotes.9Dataset and statistics are available at https://archive.org/details/stackexchange.

7

Page 8: The Chinese Voting Process: The Evolution of Online

3.1 Parameter inference

The goal is to infer parameters θ = {{qij}, λ, µ, {νi}}. We sometimes use f and g instead tocompactly indicate parameters associated to each function. The likelihood of one CVP step in theitem i at time t is L(t)

i (τ,θ;α, σ) =

α+∑J

(t−1)ij=1 f

(t−1)i (j)

N (qi,z

(t)i

; 0, σ2)}1[z

(t)i =J

(t−1)i +1]{

f(t−1)i (z

(t)i )

α+∑J

(t−1)ij=1 f

(t−1)i (j)

p(v(t)i |qi,z(t)i

, g(t−1)i (j))

}1[z

(t)i ≤J

(t−1)i ]

.

The fractions in two terms indicate the probability of writing a new response and choosing exist-ing responses in the selection phase. The other two probability expressions respectively describequality sampling from a normal distribution and the logistic regression in the voting phase.

It is important to note that our setting differs from many CRP-based models. The CRP is typicallyused to represent a non-parametric prior over the choice of latent cluster assignments that mustthemselves be inferred from noisy observations. In our case, the result of each choice is directlyobservable because we have the complete trajectory of helpfulness votes. As a result, we onlyneed to infer the continuous parameters of the process, and not combinatorial configurations ofdiscrete variables. Since we know the complete trajectory where the rank inside the function f isa part of the true observations, we can view each vote as an independent sample. Denoting the lasttimestamp of the item i by Ti, the log-likelihood becomes `(τ,θ;α, σ) =

∑mi=1

∑Tit=1 logL

(t)i and

is further separated into two pieces:

`v(θ;σ) =m∑i=1

Ti∑t=1

{1[write] · logN (q

i,z(t)i

; 0, σ2) + 1[choose] · log p(v(t)i |qi,z(t)i

, g(t−1)i (j))

}(2)

`s(τ ;α) =m∑i=1

Ti∑t=1

{1[write] · log

α

α +∑J

(t−1)ij=1 f

(t−1)i (j)

+ 1[choose] · logf(t−1)i (z

(t)i )

α +∑J

(t−1)ij=1 f

(t−1)i (j)

}.

Inferring a whole trajectory based only on the final snapshots would likely be intractable for a non-exchangeable model. Due to the continuous interaction between f and g for every time step, smallmis-predictions in the earlier stages will cause entirely different configurations. Moreover the rankfunction inside f is in many cases site-specific.10 By observing all trajectories of random variables{z(t)i , v

(t)i }, we can decouple f and g, reducing the inference problem into separate estimations on

the selection and the voting phases. Maximizing `v can be efficiently solved by `2-regularized lo-gistic regression as demonstrated for (1). If the hyper-parameter α is fixed, maximizing `s becomesa convex optimization because τ appears in both the numerator and the denominator.

10We generally know that Amazon decides the display order by the portion of positive votes and the total numberof votes on each response, but the relative weights between them are not known. We do not know how StackExchangeforums break ties, which affects highly in the early stages of voting.

8

Page 9: The Chinese Voting Process: The Evolution of Online

3.2 Behavioral coefficients

To succinctly measure overall voting behaviors across different communities, we propose twocommunity-level coefficients. Trendiness indicates the sensitivity to positional popularity in the se-lection phase. While the community-level τ parameter renders Trendiness simply to avoid overly-complicated models, one can easily extend the CVP to have per-item τi to better fit the data. In thatcase, Trendiness would be a summary statistics for {τi}. Conformity captures users’ receptivenessto prevailing polarity in the voting phase. To count every single vote, we define Conformity to be ageometric mean of odds ratios between majority-following votes and majority-disagreeing votes.Let Vi be the set of time steps when users vote rather than writing responses in the item i. Say n isthe total number of votes across all items in the target community. Then Conformity is defined as

κ =

{m∏i=1

∏t∈Vi

(P (v(t+1)i = 1|qt

i,z(t+1)i

, λt, µt, νti )

P (v(t+1)i = 0|qt

i,z(t+1)i

, λt, µt, νti )

)h(t)i

}1/n

where h(t)i =

1 (n+(t)ij ≥ n

−(t)ij )

−1 (n+(t)ij < n

−(t)ij )

.

To compute Conformity κ, we need to learn θt = {qtij, λt, µt, νti} for each t, which is a set ofparameters learned on the data only up to the time t. This is because the user at time t cannot seeany future which will be given later than the time t. In addition, while positive votes are mostlydominant in the end, the dominant mood up to time t could be negative, exactly when the userat time t + 1 tries to vote. In this case, h(t)i becomes −1, inverting the fraction to be the ratio offollowing the majority against the minority. By summarizing learned parameters in terms of (τ, κ),we can compare different selection and voting behavior across different communities.

4 Experiments

We evaluate the CVP on product reviews from Amazon and 82 issue-specific forums from theStackExchange network. The Amazon dataset [23] originally consisted of 595 products with dailysnapshots of writing/voting trajectories from Oct 2012 to Mar 2013. After eliminating duplicateproducts11 and products with fewer than five reviews or fragmented trajectories,12 363 productsare left. For the StackExchange dataset, we filter out questions from each community with fewerthan five answers besides the answer chosen by the question owner.13 We discard communitieswith fewer than 100 questions after pre-processing. Many of these are META forums where usersdiscuss policies and logistics for their original forums. (While we exclude such forums for thecleaner visualization, we verify that they are generally clustered around the META forums included

11Different seasons of the same TV shows have different ASIN codes but share the same reviews.12If the number of total votes between the last snapshot of the early fragment and the first snapshot of the later

fragment is less than 3, we fill in the missing information simply with the last snapshot of the earlier fragment.13The answer selected by the question owner is displayed first regardless of the voting scores.

9

Page 10: The Chinese Voting Process: The Evolution of Online

Table 5: Predictive analysis on the first 50 votes: In the selection phase, the CVP shows better negative log-likelihoodin almost all forums. In the voting phase, the full model shows better negative log-likelihood than all subsets offeatures. Quality analysis at the final snapshot: Smaller residuals and bumpiness show that the order based on theestimated quality qij more coherently correlates with the average sentiments of the associated comments than the orderby display rank. (OF=Overflow, SOF=StackOF, rest=Exchange, Blue: p ≤ 0.001, Green: p ≤ 0.01, Red: p ≤ 0.05)

Community Selection Voting Residual BumpinessCRP CVP qij λ νi qij , λ qij , νi λ, νi Full Rank Qual Rank Qual

SOF(22925) 2.152 1.989 .107 .103 .108 .100 .106 .100 .096 .005 .003 .080 .038math(6245) 1.841 1.876 .071 .064 .067 .062 .066 .060 .059 .014 .008 .280 .139english(5242) 1.969 1.924 .160 .146 .152 .141 .147 .137 .135 .018 .007 .285 .149mathOF(2255) 1.992 1.910 .049 .046 .049 .045 .047 .046 .045 .009 .007 .185 .119physics(1288) 1.824 1.801 .174 .155 .166 .150 .156 .146 .142 .032 .014 .497 .273stats(598) 1.889 1.822 .051 .044 .048 .043 .046 .042 .042 .030 .019 .613 .347judaism(504) 2.039 1.859 .135 .124 .132 .121 .125 .118 .116 .046 .018 .875 .403amazon(363) 2.597 2.261 .266 .270 .262 .254 .243 .253 .240 .023 .016 .392 .345meta.SOF(294) 1.411 1.575 .261 .241 .270 .229 .243 .232 .225 .018 .013 .281 .255cstheory(279) 1.893 1.795 .052 .040 .053 .039 .049 .039 .038 .032 .029 .485 .553cs(123) 1.825 1.780 .128 .100 .118 .099 .113 .097 .096 .069 .040 .725 .673linguistics(107) 1.993 1.789 .133 .127 .130 .122 .123 .120 .116 .074 .038 .778 .656AVERAGE 2.050 1.945 .109 .103 .108 .099 .105 .098 .095 .011 .006 .186 .101

in Figure 1, having low Trendiness and low Conformity.)

4.1 Predictive analysis

In each community, our prediction task is to learn the model up to time t and predict the action att+ 1. We align all items at their initial time steps and compute the average negative log-likelihoodof the next actions based on the current model. Since the complete trajectory enables us to separatethe selection and voting phases in inference, we also measure the predictive power of these twotasks separately against their own baselines. For the selection phase, the baseline is the CRP, whichselects responses proportional to the number of accumulated votes or writes a new response withthe probability proportional to α.14 When t<50, as shown in the first column of Table 5, the CVPsignificantly outperforms the CRP based on two-tailed paired t-tests. Using the function f basedon display rank and Trendiness parameter τ turns out to be a more precise representation of posi-tional accessibility. Especially in the early stages, users often select responses displayed at lowerranks with fewer votes. While the CRP has no ability to give high scores in these cases, the CVPproperly models it by decreasing τ . The comparative advantage of the CVP declines as more votesbecome available and the correlation between display rank and the number of votes increases. Foritems with t≥50, there is no significant difference between the two models as exemplified in thethird column of Figure 3. These results are coherent across other communities (p>0.07).

14We fix α to 0.5 after fitting over a wide range of values.

10

Page 11: The Chinese Voting Process: The Evolution of Online

112

I need to execute a Python script from the Django shell. I tried ./manage.pyshell<<my_script.py. But it didn’t work. It was just kinda waiting for me to write something.

195

The << part is wrong, use < instead: $./manage.pyshell<myscript.py. You could also do:

78

You’re not recommended to do that from the shell – and this is intended as you shouldn’t really be executing random scripts from the django environment (but there are ways around this, see the other answers). If this is a script that

62

I am late for the party but I hope that my response will help someone: You can do this in your Python script: importsys,ossys.path.append('/path/to/your/django/app')

9

For me, this only executes the first line of the script. The only thing that works is combining both methods: ./manage.pyshell<<EOF\execfile('myscript.py')\EOF – Steve Bennett Jul 5 '13 at 0:49

It does not work anymore with Python 3+. Any idea to replace this? – David D. Apr 5 '15 at 13:27

6

Once again on StackOverflow, the correct answer is not confirmed, and has fewer votes. C-; – Phlip Nov 16 '15 at 20:55

This is wise! Thanks for sharing. SO gives good answers but the best ones like this are not recognised sometimes. – Viet Jan 9 '16 at 19:01

This is definitely the correct answer – bipsa Mar 9 '16 at 16:04

Figure 2: A question and the first three answers from StackOverflow. The comments of the first two answers arehighlighted in the blue boxes. By reading the sentiment of comments, we are able to presume the second answer isindeed better than the first one, but it is too late to fix the order. (https://stackoverflow.com/questions/16853649)

Improving predictive power on the voting phase is difficult because positive votes dominate inevery community. We compare the fully parametrized model to simpler partial models in whichcertain parameters are set to zero. For example, a model with all parameters but λ knocked outis comparable to a plain Polya Urn. As illustrated in the second column of Table 5, we verifythat every sub-model is significantly different from the full model in all major communities basedon one-way ANOVA test, implying that each feature adds distinctive and meaningful information.Having the item-specific length bias νi provides significant improvements as well as having in-trinsic quality qij and current opinion parameter λ. While we omit the log-likelihood results witht ≥ 50, all model better predicts true polarity when t ≥ 50, because the log-linear model obtains amore robust estimate of community-level parameters as the model acquires more training samples.

4.2 Quality analysis

The primary advantage of the CVP is its ability to learn intrinsic quality for each response thatfilters out excessive momentum from self-reinforcing voting processes. We validate these scores bycomparing them to another source of user feedback: both StackExchange and Amazon allow usersto attach comments to responses along with votes. Figure 2 illustrates one exemplary questionand three answers from the top with highlighting the associated comments. For each response,

11

Page 12: The Chinese Voting Process: The Evolution of Online

2 0 2Rank z-score

0

1

2

3

4

5

6

Com

ment

rati

o

2 0 2Quality

0

1

2

3

4

5

6

Com

ment

rati

o

2 0 2Rank z-score

2.2

2.3

2.4

2.5

2.6

2.7

Senti

ment

2 0 2Quality

2.2

2.3

2.4

2.5

2.6

2.7Senti

ment

0 25 50Timestep

1.0

1.5

2.0

Negati

ve log-l

ikelih

ood

CVP

CRP

0 25 50Timestep

0.07

0.08

0.09

0.10

0.11

0.12

0.13

Negati

ve log-l

ikelih

ood

νi

λ

qij

Full

Figure 3: Comment and likelihood analysis on the StackOverflow forum. The left panels show that responses withhigher ranks tend to have more comments (top) and more positive sentiments (bottom). The middle panels showresponses have more comments at both high and low intrinsic quality qij (top). The corresponding sentiment correlatesmore cohesively with the quality score (bottom). Each blue dot is approximately an average over 1k responses, andwe parse 337k comments given on 104k responses in total. The right panels show predictive power for the selectionphase (top) and the voting phase (bottom) up to t < 50 (lower is better).

we record the number of comments and estimate the average sentiment of those comments.15 Asa baseline, we also calculate the final display rank of each response, which we convert to a z-score to make it more comparable to the quality scores qij . After sorting responses based ondisplay rank and quality rank, we measure the association between the two rankings and commentsentiment with linear regression. Results are shown for StackOverflow in Figure 3. As expected,the responses with better display ranks have more comments, but we also find that the responses inintrinsic quality order have more comments at both high and low scores. While both better displayrank and higher quality score qij are clearly associated with more positive comments (slope ∈[0.47, 0.64]), the residuals of quality rank 0.012 are on average less than the half the residuals ofdisplay rank 0.028. In addition, we also calculate the “bumpiness” of these plots by computingthe mean variation of two consecutive slopes between each adjacent pair of data points. Qualityrank reduces bumpiness of display rank from 0.391 to 0.226 in average, implying that the CVP-estimated intrinsic quality yields locally consistent ranking as well as globally consistent.16

15We use Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank by Socher et al.16All numbers and p-values in the paragraphs are weighted averages on all 83 communities, whereas Table 5 only

includes results for the major communities and their own weighted averages due to space limits.

12

Page 13: The Chinese Voting Process: The Evolution of Online

1.03.5

Figure 4: Sub-community embedding for StackOverflow.

4.3 Community analysis

The 2D embedding in Figure 1 shows that we can compare and contrast the different evaluationcultures of communities using two inferred behavioral coefficients: Trendiness τ and Conformityκ. Communities are sized according to the number of items and colored based on a manual clus-tering. Related communities collocate in the same neighborhood. Religion, scholarship, and meta-discussions cluster towards the bottom left, where users are interested in many different opinions,and are happy to disagree with each other. Going from left to right, communities become moretrendy: users in trendier communities tend to select and vote mostly on already highly-ranked re-sponses. Going from bottom to top, users become increasingly likely to conform to the majorityopinion on any given response. By comparing related communities we can observe that charac-teristics of user communities determine voting behavior more than technical similarity. Highlytheoretical and abstract communities (cstheory) have low Trendiness but high Conformity. Moreapplied, but still graduate-level, communities in similar fields (cs, mathoverflow, stats) show lessConformity but greater Trendiness. Finally, more practical homework-oriented forums (physics,

math) are even more trendy. In contrast, users in english are trendy and debatable. Users in Ama-

zon are most sensitive to trendy reviews and least afraid of voicing minority opinion.

StackOverflow is by far the largest community, and it is reasonable to wonder whether the Trendi-ness parameter is simply a proxy for size. When we subdivide StackOverflow by programminglanguages however (see Figure 4), individual community averages can be distinguished, but theyall remain in the same region. Javascript programmers are more satisfied with trendy responsesthan those using c/c++. Mobile developers tend to be more conformist, while Perl hackers aremore likely to argue. Due to such disparate voting behavior, each community would require dif-ferent strategies to uncover under-rated responses.

13

Page 14: The Chinese Voting Process: The Evolution of Online

5 Conclusions

Helpfulness voting is a powerful tool to evaluate diverse and abundant information such as reviewsof products and answers to questions. However such votes can be socially reinforced by positionalaccessibility and presentational factors such as existing votes or relative length of response. TheCVP can take into account sequences of votes, assigning different weights based on the context that

each vote was cast, and thereby showing higher predictive power on estimating next votes. Insteadof modeling mechanism-specific and strategically varying response ordering function, we leveragethe fully observed trajectories of votes, estimating hidden intrinsic quality of each response thatproduces better ranking of responses than existing systems’ orders. We also compare different lev-els of biases across different communities by formulating two behavioral coefficients: Trendinessand Conformity. As we are more able to observe social interactions as they are occurring and notjust summarized after the fact, we will increasingly be able to use models beyond exchangeability.

References

[1] Young Kwark, Jianqing Chen, and Srinivasan Raghunathan. Online product reviews: Implications for

retailers and competing manufacturers. Information systems research, 25(1):93–110, 2014.

[2] Yubo Chen and Jinhong Xie. Third-party product review and firm marketing strategy. Marketing

Science, 24(2):218–240, 2005.

[3] Weber Shandwick. Buy it, try it, rate it: Study of consumer electronics purchase deicisions in the

engagement era. KRC Research, 2012.

[4] Pei-Yu Chen, Samita Dhanasobhon, and Michael D Smith. All reviews are not created equal: The

disaggregate impact of reviews and reviewers at amazon. com. 2008.

[5] Yla R. Tausczik, Aniket Kittur, and Robert E. Kraut. Collaborative problem solving: A study of

mathoverflow. In Computer-Supported Cooperative Work and Social Computing, CSCW’ 14, 2014.

[6] Yandong Liu, Jiang Bian, and Eugene Agichtein. Predicting information seeker satisfaction in com-

munity question answering. pages 483–490. ACM SIGIR, 2008.

[7] Vassilis Kostakos. Is the crowd’s wisdom biased? a quantitative analysis of three online communities.

In CSE, volume 4, pages 251–255. IEEE, 2009.

[8] Eric K Clemons, Guodong Gordon Gao, and Lorin M Hitt. When online reviews meet hyperdifferen-

tiation: A study of the craft beer industry. Journal of management information systems, 2006.

[9] Jingjing Liu, Yunbo Cao, Chin-Yew Lin, Yalou Huang, and Ming Zhou. Low-quality product review

detection in opinion summarization. EMNLP-CoNLL ’07, 2007.

14

Page 15: The Chinese Voting Process: The Evolution of Online

[10] Thorsten Joachims, Laura Granka, Bing Pan, Helene Hembrooke, Filip Radlinski, and Geri Gay. Eval-

uating the accuracy of implicit feedback from clicks and query reformulations in web search. 2007.

[11] Cristian Danescu-Niculescu-Mizil, Gueorgi Kossinets, Jon Kleinberg, and Lillian Lee. How opinions

are received by online communities: A case study on Amazon.Com helpfulness votes. In WWW, 2009.

[12] Matthew J. Salganik, Peter Sheridan Dodds, and Duncan J. Watts. Experimental study of inequality

and unpredictability in an artificial cultural market. Science, 311:854–856, 2006.

[13] Matthew J. Salganik and Duncan J. Watts. Leading the herd astray: An experimental study of self-

fulfilling prophecies in an artificial cultural mmrket. Social Psychology Quarterly, 71:338–355, 2008.

[14] Yisong Yue, Rajan Patel, and Hein Roehrig. Beyond position bias: Examining result attractiveness as

a source of presentation bias in clickthrough data. WWW, 2010.

[15] Lev Muchnik, Sinan Aral, and Sean J Taylor. Social influence bias: A randomized experiment. Science,

341(6146):647–651, 2013.

[16] Keith Burghardt, Emanuel F Alsina, Michelle Girvan, William M Rand, and Kristina Lerman. The

myopia of crowds: A study of collective evaluation on stack exchange. PLOS, 2016.

[17] David J. Aldous. Exchangeability and related topics. In Ecole d’Ete St Flour, pages 1–198. 1985.

[18] Soo-Min Kim, Patrick Pantel, Tim Chklovski, and Marco Pennacchiotti. Automatically assessing

review helpfulness. In EMNLP, 2006.

[19] Anindya Ghose and Panagiotis G. Ipeirotis. Designing novel review ranking systems: Predicting the

usefulness and impact of reviews. ICEC, pages 303–310, 2007.

[20] Jahna Otterbacher. ’helpfulness’ in online communities: A measure of message quality. In Proceedings

of the SIGCHI Conference on Human Factors in Computing Systems, CHI ’09, pages 955–964, 2009.

[21] Lionel Martin and Pearl Pu. Prediction of helpful reviews using emotion extraction. In Proceedings of

the Twenty-Eighth AAAI Conference on Artificial Intelligence, AAAI ’14, pages 1551–1557, 2014.

[22] Stefan Siersdorfer, Sergiu Chelaru, Jose San Pedro, Ismail Sengor Altingovde, and Wolfgang Nejdl.

Analyzing and mining comments and comment ratings on the social web. ACM Trans. Web, 2014.

[23] Ruben Sipos, Arpita Ghosh, and Thorsten Joachims. Was this review helpful to you?: It depends!

context and voting patterns in online content. WWW, 2014.

[24] David Blei, Thomas Griffiths, Micheal Jordan, and Joshua Tenenbaum. Hierarchical topic models and

the nested chinese restaurant process. In NIPS, 2003.

[25] David M. Blei and Peter I. Frazier. Distance dependent chinese restaurant processes. Journal of

Machine Learning Learning Research, pages 2461–2488, 2011.

15