rate this transaction: towards principles for the design ... · 1 rate this transaction: towards...

1

Rate this transaction: Towards principles for the design of

market feedback systems

Gary E. Bolton1, Alina Ferecatu2, David J. Kusterer3

— Preliminary draft —

— Please do not circulate —

Abstract: Market feedback systems are critical to enforcing trader trustworthiness many on and

offline markets. In this paper, we aim to outline guidelines towards how feedback systems should

be designed. We conduct an incentive aligned experiment mirroring an online market, to

characterizes buyer behavior, in terms of ratings given and use of feedback scores. We investigate

the friction in information transmission due to the choice of the feedback scale (3-point vs. 5-point

scale) and feedback elicitation question (general or directed feedback). We characterize the impact

of these feedback system design choices on system informativeness with an entropy model

forecasting sellers’ performance. Using a hidden Markov mixture of experts model, we investigate

whether buyers switch between giving feedback based on quality received, or based on their

profits. Our analysis suggests that feedback systems mapping the feedback request and elicitation

scale on the most diagnostic dimension of seller future performance are most informative.

Keywords: Market design, Recommendation systems, Online ratings, Entropy, Hidden Markov

Mixture of Experts models

1 O.P. Jindal Chair Professor of Management Economics, Jindal School of Business, University of Texas at Dallas, [email protected] 2 Assistant Professor of Marketing, Rotterdam School of Management, Erasmus University, [email protected] 3 University of Cologne, [email protected]

2

1 Introduction

Markets are increasingly designed, not simply evolved (Roth, 2002). Feedback systems are a designed

feature of many modern markets. These systems provide the markets with information on the past reliability

of trading partners, enabling the tit-for-tat strategies that enforce market trustworthiness. Yet little is known

about how feedback systems should be designed, while there is a good deal of evidence that there is room

for improvement. In this paper, we lay out a normative benchmark, involving simple Bayesian measures,

for gauging the success of a feedback system, and we take a first step towards establishing design principles

that might guide a trading system towards reaching this benchmark.

Figure 1 provides a bird’s eye view of how feedback systems integrate into the market. The elements of

the system in need of design are in the blue boxes: The message space used by traders to report a trade

experience, and the feedback aggregation mechanism combines reports to obtain an overall rating for an

individual trader. Traders interface with the feedback system in two ways and these are shown in green

print. Traders provide the feedback information as input to the system. They use the outputted ratings to

make decisions on their willingness-to-trade with potential partners in the market.

Implementing a feedback system requires a number of design choices. What information should be elicited

from traders? What format should this information take? What metrics should be used to aggregate

feedback information into trader ratings? How should the ratings be presented to traders? Casual

empiricism suggests certain norms in current practice. Traders are asked a single open-ended question,

such as ‘Rate this transaction”. Feedback is typically given on a Likert-type scale where the number of

ratings varies between 3 and 10 points, but binary feedback scales are used as well. Individual seller ratings

are aggregated by averaging, which is then posted to the marketplace as the traders’ overall feedback

ratings. Additional information, some platform specific, is often collected and displayed as well, such as

individual text comments and breakouts of individual score histories, but the average feedback score tends

to be the headline rating, the rating information most prominently displayed and most easily accessed. Our

investigation focuses on the design of the headline rating part of the system. Other aspects of the system

are likely important as well, but given their prominence in virtually all systems, the headline numbers strike

us as the place to start a systematic investigation.

In formulating design principles, it is important to have a clear benchmark for the objective being served.

The objective of a market feedback system is to provide ratings that are informative in two senses. First,

that a trader’s rating represents an accurate picture of that trader’s past performance. A system that fails in

the first regard risks its credibility with traders. Second, that a trader’s ratings have forecast value for that

trader’s future performance. A system that fails in this regard will not be useful to enforce trust in the

3

marketplace. Note that the second criteria does not necessarily follow from the first. A system might

accurately reflect the past but be suboptimal with regard to informativeness about future trader behavior,

perhaps because what it measures are not the best forecast variables. Alternatively, the feedback system

could fail to provide incentives for trustworthy behavior, such that sellers start “milking” the reputation

they have built previously. Sound principles for the design of these systems, therefore, are principles that

optimize the informativeness of the system by focusing on the important forecast variables.

In the normatively best feedback system, traders would accurately report all relevant aspects of their trading

partner’s performance, the system would aggregate this information as forecast guidance that would then

be used optimally by traders in deciding on their interactions with future trading partners. In practice, there

are two kinds of challenges to achieving such a system. The first is institutional. The need to aggregate

information about a trader across her various interactions requires a certain amount of uniformity in

reporting, meaning constraints on the message space. In addition, there is ambiguity about what constitutes

the best aggregation technique. Design mistakes in either regard, might lead to a loss of informativeness.

The second challenge is behavioral. The system relies on traders to report their experiences accurately.

And no matter how informative ratings might be in theory, trader mistakes made in application reduce the

use-value of the system. To be successful, feedback systems must be designed to induce traders to give

accurate and useful feedback.

Figure 1: A bird’s eye view of the integration of feedback systems in the marketplace

Decidewillingness-to-trade

Submitfeedback

4

Previous studies show that traders with favorable feedback scores are more likely to trade and at relatively

favorable terms (for overviews of the literature, see Bajari and Hortacsu 2004 and Tadelis 2016). However,

there is a sizable literature to show that the feedback systems presently used in practice have a number of

problems. Feedback scores are compressed, making inferences about seller behavior more difficult (Gregg

and Scott 2006, Dellarocas and Wood 2008, Bauerly 2009, Bolton et al. forthcoming). The early research

on eBay’s feedback system where both buyers and sellers could rate each other showed that ninety-nine

percent of all feedback left was positive (Resnick et al. 2006, Kauffman and Wood 2006). However, there

is evidence that the share of dissatisfied traders is much larger: Dellarocas and Wood (2008) used structural

estimation and found that 21% of buyers and 14% of sellers had a bad experience in rare coin auctions on

eBay. One reason is that dissatisfied buyers tend to remain silent more often, or even report a positive rating

out of fear of a negative retaliatory rating (Bolton et al. 2013). Silent transactions are not reported in most

systems, although they carry information about the sellers as they more likely to be transactions where

something went wrong. Nosko and Tadelis (2015) showed in a large-scale field study on eBay that buyer

satisfaction increases if search results are sorted in a way that also takes into account the amount of silent

transactions a seller had.

These issues suggest there are problems with the current feedback systems, either pertaining to how

feedback ratings are elicited or what the resulting information inputted is, or with the way the information

is aggregated and presented to users. Yet at present there is little in the way of a model for how feedback

is given or used to guide us (see Golman and Bhatia 2012 for a model of lenient feedback giving). On the

institutional side, we lack clear benchmarks for establishing the effectiveness of these systems. We do not

know what the optimal rating metrics are or the optimal aggregation technique. On the behavioral side –

and much related to the institutional questions - we do not understand how traders interact with these

systems at an elementary level: How do traders determine the rating that they give other traders? How do

they use feedback information to determine their willingness-to-trade and at what price? Absence answers

to these questions, it is hard to see how we can come up with answers to the visible problems mentioned

above.

We study a market stylized to capture an elementary trust problem that feedback systems aim to solve. The

setup is characterized by adverse selection with regard to the seller types. Buyers have incomplete

information about seller reliability. Specifically, sellers differ in the average quality of product they deliver,

although the quality in any transaction is noisy around the average associated with the sending seller type.

Prior to market open, the distribution of seller types is made common knowledge among buyers. After a

transaction is made, a buyer can give a numerical feedback rating to the seller. All the ratings a seller has

previously received are averaged, resulting in a seller feedback score that is shown to the next prospective

5

buyer matched with that seller. In essence, the feedback system is informative to the extent that the provided

feedback scores aid buyers in discovering a seller’s true market type prior to trading with that seller. The

nature of the quality of product in the market lends itself to a rather canonical mapping into feedback scores.

Assuming that buyers use this mapping, a straightforward Bayesian updating routine determines the

dynamically optimal informativeness of the system across periods of trade. We propose to measure the

informativeness by the entropy of the beliefs about the seller types, a concept adapted from information

theory.

We then implement this market in a laboratory setting to study the system’s actual informativeness. The

experimental market isolates human buyer behavior, ‘fixing’ seller as robots to play the role of the noisy

sellers, a feature that is known to the buyers. We can then compare buyer behavior, in terms of feedback

giving and use of feedback scores, to the normative benchmarks. We vary two features of the institutional

structure of the feedback system. The first feature is the question used to prime buyers to give feedback.

Typically, the question is open ended, such as “Rate this transaction”, “Rate your experience”, or “Leave a

rating”. Our normative model suggests that these questions should be more directed. The reason is that

scores are more informative if all ratings correspond to a common mapping. So in addition to the baseline

treatment where we ask “Leave a rating”, we run treatments that ask a directed “Rate the quality you

received”. The second feature we manipulate is the scale for scoring. In many systems, it is a five point

Likert scale (eBay’s DSR scoring was inspired by this, see Bolton et al. 2013). Our normative model

suggests the scale should be chosen tailored to what the priming questions directs to rate. For the quality

in the experiment, a more detailed scoring range should be more informative than a less detailed one. Thus,

in addition to a 5-point scale, we consider a 3-point scale; the latter’s informativeness can be close to the

former’s if used with an appropriate mapping from quality to a rating, but decreases sharply otherwise.

The remainder of the paper is organized as follows. In Section 3, we describe the experiment design, the

normative model, and derive benchmarks for informativeness. In Section 4, we show descriptive results

and compare the empirical informativeness of the feedback system in the four treatment to the benchmarks.

In Section 5, we develop a Hidden Markov Mixture of Experts model to get more insight into what features

the subjects base their ratings on to show why informativeness is lower than theoretically possible. Section

6 concludes.

6

2 Experimental design and formal description of the model

2.1 Description of the experimental design

In our experimental market, human subjects in the role of buyers interact with computerized sellers offering

goods of unknown quality. Sellers differ in their type, that is, the distribution according to which they ship

different levels of quality. Buyers are informed that there are three types of sellers – called Type A (low

performers), Type B (medium performers), and Type C (high performers) - and their respective quality

distributions (see Figure 2). Hence, our markets are characterized by adverse selection.

Figure 2: Computerized sellers privately determine quality sent 𝑄 ∈ {20, 40, 60, 80, 100}. The distributions of quality

sent by seller type show the market accommodates low (A), medium (B), and high (C) performers.

The human buyers engage in repeated auctions to buy the goods offered by robot sellers over a series of

periods. In each period, one buyer is matched with exactly one robot seller and plays the stage game outlined

next (see also Figure 3).

In the auction stage, buyers learn their valuation for the good offered by the seller in the current period. It

is drawn randomly from a uniform distribution of all integers from 200 to 300. We use Experimental

Currency Units (ECU) as monetary unit. Buyers receive feedback information about the matched seller: the

number of transactions and the seller’s average feedback score (equal to the average of all previous

feedback ratings, and empty if the seller did not yet receive any feedback). Buyers are asked to submit their

maximum willingness to pay for the good. We use a BDM mechanism (Becker et al., 1964) to elicit buyers’

true willingness to pay for the product. The price is drawn randomly from a custom distribution of all

integers between 0 and 300. The distribution is left skewed to ensure a reasonable amount of transactions

(see the next section for the exact specification). Note that buyers’ optimal strategy is to bid their true

willingness to pay, which is their valuation multiplied with the expected quality based on the belief about

the seller type.

Type A Type B Type C

20% 40% 60% 80% 100% 20% 40% 60% 80% 100% 20% 40% 60% 80% 100%

10

20

40

Level of shipped quality

Prob

abilit

y

7

Figure 3: The experimental platform: the auction and feedback stages

In the feedback stage, buyers first learn whether the transaction took place. A transaction occurs if the

random price is lower than the bid. If the good is sold, the buyer has to pay the random price, learns the

level of quality the robot seller shipped (20% to 100% quality), and his profit in the current period. Buyers’

profits are equal to their valuation multiplied with the received level of quality minus the random price if

the buyer wins the auction, and 0 otherwise. Second, the buyers are asked to leave a rating. Leaving

feedback is compulsory, to avoid any confounds related to buyer selection in feedback giving. After

submitting a rating, the period ends and the next one begins.

Our treatments are implemented on the feedback page. We use a full factorial design between feedback

scale and feedback elicitation question and thus have four treatments in total. Subjects are asked to leave

general feedback (“Leave a rating”) vs. directed feedback based on quality (“Rate the quality”). Subjects

can leave feedback on a 3-point scale (matched with seller type), or on a 5-point scale (matched with

quality).

The stage game described above is repeated for 60 periods. Since buyers’ bids depend on the feedback

scores given by other players, all game observations are inherently correlated through the feedback

structure. Therefore, in order to have independent observations, we create matching groups of 6 human

players at the beginning of the experiment and consider one matching group to be one independent

observation. Within one matching group there are 6 sellers, two of each of the three types A, B, and C.

Sellers and buyers are randomly matched within matching groups at the beginning of each period. We use

8

the same random matching routine in all treatments, to reduce noise, such that matching, buyers’ valuation,

random price, seller type and seller quality are constant across matching groups.

At the beginning of the experiment, before the first period starts, subjects are made familiar with the quality

distributions of the different seller types and the BDM mechanism. On a first simulation screen for the

quality distributions, they see three empty graphs, one for each seller type. By clicking a button below each

of the boxes, one level of quality is drawn according to the seller’s distribution and shown as a histogram

in the graph. With each new draw the histogram is updated. They are encouraged to click often until they

see that the empirical distribution might differ at first but eventually converges to the theoretical

distributions outlined in Figure 2. On the second simulation screen, they see the random price distribution

used for the BDM mechanism. They are asked to select a hypothetical maximum price they are willing to

pay and an actual bid. The price distribution graph then is updated to show the fact that bids above the

willingness to pay can lead to transactions with a loss, and bids below the willingness to pay can lead to

foregone profitable trades (see Appendix A for screenshots). We made it transparent to the subjects that the

optimal way to bid is to bid exactly the willingness to pay in order to make the bid a good proxy for expected

quality.

After the main experiment was over, subjects were asked to fill in a socio-demographic questionnaire,

psychometric scales such as risk aversion (the bomb risk elicitation task, Crosetto et al. 2013), numeracy

(the Berlin Numeracy Test, Cokely et al. 2012), and answer a question regarding their feedback giving habit

on online platforms in real-life transactions (1-almost never to 5-almost always). They are also asked to

provide their reasoning on how they gave feedback ratings in the experiment. Once the experiment is over,

their earnings from the main game plus an initial endowment of 200 ECU are converted to euro (320 ECU

= 1 euro). The earnings from the main game, the risk aversion and numeracy tasks and a show-up fee of 4

euro are added and paid out to participants in cash.

We implemented and ran our experiment using SoPHIE, a web-based experimental software. We ran 8

experimental sessions in a physical lab; each session lasted about 90 minutes. One treatment (2 sessions)

was run in February 2018 and the remaining three treatments were run in June 2018, using a subject pool

from a large western European university. Subjects in this pool are accustomed to participating in incentive

aligned experiments involving no deception such as ours. 6 (2) sessions had 30 (24) buyers participate in

the auction against robot sellers, with 5 (4) matching groups per session (the number was lower in two

sessions due to no-shows).

9

2.2 Description of the formal model

In this section we describe the experiment again in terms of a formal model for the informativeness analysis

in the following section. Figure 4 gives an overview of the model. In each period 𝑡 = {1,2, … , 60}, bidders

are randomly matched with a seller. The seller’s type is drawn from Θ ∈ {𝜃1, 𝜃2, 𝜃3}, i.e., there are good,

average, and bad sellers. There is an equal number of each seller type such that buyers have a uniform prior

𝜇(𝜃) = 78 over Θ. A seller ships quality from the set 𝑄 = {1,… ,5} denoting the five quality levels between

20% and 100% in increments of 20 percentage points. Each seller ships quality with probabilities 𝑝; =

(𝑝;<,… , 𝑝;

=). As noted before, the good seller has shipping probabilities 𝑝;> = (0.1, 0.1, 0.2, 0.2, 0.4), the

average type ships quality according to 𝑝;@ = (0.1, 0.2, 0.4, 0.2, 0.1), and the bad type is defined by the

shipping probabilities 𝑝;A = (0.4, 0.2, 0.2, 0.1, 0.1). Denote the average quality shipped by each seller type

by 𝑞C;.

The platform collects information on the sellers and consists of a set of admissible feedback ratings, a

feedback rating aggregation function, and a feedback information display. In the experiment, ratings are

allowed in the set of integers from 1 to 𝑘, denoted by 𝑅F, and 𝑘 is either 5 or 3. The feedback aggregation

function 𝑓(𝑟) = 7I∑ 𝑟KLKM< is the average of all ratings received for a particular seller up to the current period,

(𝑟<, … , 𝑟L), where 𝑟 ∈ 𝑅F and 𝑛 is the number of ratings received . The feedback information display 𝐷P =

(𝑓(𝑟), 𝑛) consists of the average rating, called feedback score, and the number of ratings received, and is

shown to the bidder in period 𝑡.

After observing the realization of quality 𝑞 in a given period from a particular seller, buyers submit a

feedback rating 𝑟 to the platform. This rating is determined by a mapping from quality to rating using the

stochastic matrix

𝑀F = R𝑝<,< ⋯ 𝑝<,F⋮ ⋱ ⋮𝑝=,< ⋯ 𝑝=,F

V, (1)

where each row is the distribution of ratings conditional on quality 𝑞 ∈ 𝑄. Hence, the mapping from quality

into ratings induces a random variable 𝑅. For each seller, the probability distribution over 𝑅 is given by

𝜌; = (𝜌;<, … , 𝜌;

F), where 𝜌;K = ∑ 𝑝;

X𝑝X,K=XM< for seller type 𝜃 ∈ Θ. The mapping 𝑀F also induces a

distribution of the feedback information display. Observe that the elements of 𝐷 can be multiplied to give

the sum of ratings 𝑧 = 𝑓(𝑟)𝑛. The distribution of a sum of (identical) random variables can be obtained

using generating functions (see Wilf 1994). In our setup, we can use the polynomial 𝐺(𝑥) = \∑ 𝜌;𝑥KFKM< ]

L,

10

where the coefficient on 𝑥^ is the probability of observing a sum of 𝑧 after 𝑛 ratings.4 Hence we have 𝜁;,L =

(𝜁;,LL , 𝜁;,L

L`<, … , 𝜁;,LLF ) as the distribution of the sum of ratings for a given seller type 𝜃 after 𝑛 ratings have

been received.

The platform’s choices of message space 𝑅F, feedback aggregation function 𝑓(𝑟), and feedback

information display 𝐷, together with the mapping matrix 𝑀F chosen by the buyers define an information

structure 𝑆F = (𝑅F, 𝑓(𝑟), 𝐷,𝑀F) determining the updating process about the seller types. The information

structure induces the set of observable signals 𝑍 after a seller received 𝑛 ratings together with the associated

probabilities 𝜁L = (𝜁LL, 𝜁LL`<, … , 𝜁LLF) over the signals, where 𝜁LK = ∑ 𝜇(𝜃)𝜁;,LK

;∈c .

In each period, bidders form beliefs 𝛽(𝜃|𝑍L) about the seller types based on the feedback information

display provided by the platform. For each seller type 𝜃, the posterior belief is given by

𝛽(𝜃|𝑍L) =f(;)gh,I

i

∑ f(;)gh,Ii

h∈j. (2)

Based on their beliefs, bidders submit an integer-valued bid 𝑏 ∈ [0,300] to a BDM mechanism where the

price 𝑝 is drawn from a mixture distribution with CDF 𝐹(𝑥) = ∑ 𝑤K𝑃K(𝑥)rKM< , where 𝑃<(𝑥)~𝑈(0,60),

𝑃u(𝑥)~𝑈(61,120), 𝑃v(𝑥)~𝑈(121,200), 𝑃r(𝑥)~𝑈(201,300), 𝑤< = 0.6, 𝑤u = 0.2, 𝑤v = 0.15, and 𝑤r =

0.05. The expected profit of a bidder is 𝐹(𝑏)(𝐸;[𝑄|𝑍P]𝑣K − 𝐸[𝑝|𝑏 > 𝑝]), where 𝑣K is the bidder’s valuation

for the good at full quality. The expected quality 𝐸;[𝑄|𝑍P] is given by ∑ 𝛽;(𝑍P);∈c 𝑞C;, the belief-weighted

average of the expected qualities for the three seller types.

4 The probabilities can be computed explicitly as 7i!𝐺

(^)(0), which is the 𝑧-th derivative of 𝐺 at 𝑥 = 0.

11

Figure 4. Overview of the formal model

3 Analysis

3.1 Informative feedback mappings

In formulating design principles, it is important to have a clear benchmark for the objective being served.

The objective of a market feedback system is to provide ratings that are informative, in the sense that a

trader’s ratings can forecast value for that trader’s future performance. We investigate the informativeness

of our 4 feedback systems by testing the extent to which a platform can predict seller types given the sellers

feedback scores.

In this section, we describe the informative feedback mapping from quality to ratings for our two rating

systems 𝑅v and 𝑅=. We use concepts from information theory to measure the informativeness of the rating

systems about the seller types (see Cover and Thomas, 2006, for an overview of information theory and

Veldkamp, 2011, for an overview of how informativeness is measured in economics and finance). The

uncertainty of a random variable 𝑋 with probability mass function 𝑝(𝑥) is measured by its entropy 𝐻(𝑋) =

Message space!"

Feedback aggregation rule

#(%)

Feedback information

display '

Mapping ("

Belief updating)*(')

Seller types ΘBid ,

Quality -Rating %

Feedback score

Platform Buyers Seller

12

−∑ 𝑝(𝑥) logu 𝑝(𝑥)�∈� .5 Entropy is 0 if 𝑝(𝑥K) = 1 for exactly one outcome 𝑖 and 𝑝\𝑥X] = 0 for all other

outcomes, and it is at its maximum when all 𝑝(𝑥) are equal. In our setup, the entropy of the prior beliefs

about the seller types 𝐻(Θ) = −∑ 𝜇(𝜃) logu 𝜇(𝜃);∈c ≈ 1.58, which is the maximum entropy of a random

variable with three outcomes. When the beliefs are updated due to new information about the seller’s type

in the form of the feedback score, the entropy decreases. In this sense, when beliefs about the seller type

converge to the true type, entropy goes to 0.

In information theory, the concept of relative entropy, or the Kullback-Leibler distance, is used to measure

the difference between two probability distributions 𝑝(𝑥) and 𝑞(𝑥) of the same random variable 𝑋.6 It is

defined as 𝑑(𝑝||𝑞) = ∑ 𝑝(𝑥) logu �(�)�(�)�∈� . Relative entropy “is a measure of the inefficiency of assuming

that the distribution is 𝑞 when the true distribution is 𝑝” (Cover and Thomas, 2006, p. 19). We use the

expected relative entropy over the possible signals

𝐸�|��[𝑑\𝛽(𝜃|𝑍P)��𝜇(𝜃)]� = ∑ 𝜁L^ ∑ 𝛽(𝜃|𝑧) logu�\𝜃�𝑧]f(;);∈c^∈�|�� . (3)

to measure the expected informativeness of an information structure 𝑆F, i.e., the expected improvement of

using the updated beliefs after observing the feedback score compared to using the prior beliefs about the

seller type. The expected relative entropy is also called the mutual information between the signal and the

seller type.

We compute the expected relative entropy explicitly for some mapping matrices as an informativeness

benchmark. We consider only mappings where the rating is weakly increasing in quality. Figure 5 shows

the expected relative entropy conditional on the number of feedback ratings received. For the rating system

𝑅= using the identity matrix 𝑀= = 𝐼= is an optimal mapping, as each quality level is submitted to the

platform without noise.7 In the rating system 𝑅v this is no longer possible as there are fewer ratings available

than there are quality levels. We benchmark the following mappings:8

5 By convention, 0 log 0 = 0, which is justified by a continuity argument as 𝑥 log 𝑥 → 0 as 𝑥 → 0. It is common to use the base 2 logarithm such that entropy is measured in bits. Other bases would only change the unit of measurement. 6 Again, by convention, 0 log �

�= 0, 0 log �

�= 0, and 𝑝 log �

�= ∞. Using the natural logarithm, relative entropy

corresponds to the expected value of the logarithm of the likelihood ratio. 7 The column-reversed identity matrix with 1 on the counterdiagonal and 0 otherwise would also be an optimally informative mapping in our setup. 8 Due to the symmetry in our sellers, the column-reversed mappings of these matrices lead to the same expected relative entropies.

13

𝑀v2 =

⎝

⎜⎛1 0 00 1 00 1 00 1 00 0 1⎠

⎟⎞, 𝑀v

3 =

⎝

⎜⎛1 0 01 0 00 1 00 0 10 0 1⎠

⎟⎞, 𝑀v

� =

⎝

⎜⎛1 0 00 1 00 1 00 0 10 0 1⎠

⎟⎞,𝑀v

� =

⎝

⎜⎛1 0 01 0 01 0 00 1 00 0 1⎠

⎟⎞

As can be seen in Figure 5, the matrix 𝑀v2 has the largest expected relative entropy of the mappings under

consideration for the amount of received ratings that can be observed in our experiment. Hence, we consider

the identity matrix 𝑀= and the matrix 𝑀v2 as the optimal benchmark mappings for the analysis of feedback

system informativeness in Section 3.3.

Figure 5: Expected relative entropy for the weakly increasing mappings described in the text, conditional

on the number of feedback ratings received.

3.2 Descriptive summary and data patterns

Table 1 gives an overview of the results in our four treatments. The transaction rates and average earnings

are higher in the treatments with the 𝑅= rating system. Within the treatments, transaction rates and profits

are higher in the treatment with the unspecific feedback elicitation question.

14

Treatment Observations Subjects Transaction rate Average Earnings

3-point scale, rate quality (R3-Q) 3600 60 70.2% 19.62€

3-point scale, rate transaction (R3-U) 3240 54 72.5% 19.95€

5-point scale, rate quality (R5-Q) 3600 60 73.1% 20.12€

5-point scale, rate transaction (R5-U) 3240 54 75.2% 20.17€

Table 1: Summary statistics across experimental treatments.

Figure 6a: The evolution of feedback scores for each of the 6 sellers across the 60 periods, and the 6 matching groups

(dashed lines). Feedback scores were normalized on the quality scale [20%, 100%], to be comparable across the 4

treatments. The golden line represents seller’s expected quality given the theoretical distribution presented at the

beginning of the experiment, and the thick grey line represents feedback score (the average across all feedback ratings).

6b: Feedback score distributions at the last period across the four treatments.

Figure 6a shows the average feedback scores of the six sellers across the 60 periods. The “5-point, rate

quality” treatment exhibits the least variation in feedback scores, allowing subjects to make more precise

inferences about the seller types, the key to solving the adverse selection problem inherent in this market.

Figure 6b suggests that at period 60, subjects are able to best distinguish between the three types of players

in the “5-point, rate quality” treatment as the feedback scores are more separated.

15

Figure 7: Rating distribution conditional on quality: feedback ratings given by buyers after receiving a good with a

certain quality level.

Going back to the question presented in the introduction, although the feedback score seems to closely

follow quality received, there is not an exact one to one mapping between quality received and the feedback

rating given by buyers (see figure 7). In the treatments with the 𝑅= rating system, the most common rating

for each level of quality is indeed the one prescribed by the matrix 𝑀=, but there are considerable deviations,

the more so in the R5 – U treatment. In the treatments where there are only three rating levels available, the

ratings for 20%, 60%, and 100% of quality are most often 1, 2, and 3, respectively, while there is again

more variance in the treatment with the general feedback elicitation question. Quality of 80% is most often

rated with a 3, although there is a considerable amount of ratings of 2. The largest difference between the

two R3 treatments occurs at a quality level of 40%: In R3 – Q, the most frequent rating is a 2, while it is a

3 in the R3 – U treatment. The mapping in R3 – U resembles the matrix 𝑀v3, while the mapping in R3 – Q

is more of a mix between matrix 𝑀v3 and 𝑀v

�.

This begs the question of identifying the determinants of buyer feedback giving behavior, which we tackle

in Section 4.

16

3.3 Feedback system informativeness

We use two approaches to compare the informativeness of the ratings in the four treatments. The first

approach is based on the expected relative entropy given the empirical mappings from the four treatments

(see Figure 8). This approach is tied closely to the theoretical analysis above. We compute for each

treatment and for each possible number of ratings the expected relative entropy for the mapping matrices

the subjects used in the experiment. Two observations can be made. First, within the R5 and the R3

treatments, the mappings used in the treatments where the feedback elicitation question refers to quality

exhibit higher informativeness. Second, the informativeness is lower in the R3 compared to the R5

treatments.

Figure 8: Expected relative entropy for the optimally informative mapping and the mappings used by subjects in the

experiment, conditional on the number of feedback ratings received.

The second approach is based on a multinomial logit model where the probability of each seller type is

predicted by the seller’s feedback score, while controlling for number of ratings a seller received, the game

period, and an interaction term between the period and feedback score. To characterize the amount of

information loss induced by each feedback system, we compute the relative entropy of the predicted

probabilities for the three seller types.

At every period t, we compute the relative entropy per treatment, and per transaction opportunity, where

𝑦K={A, B, C} are the three seller types at every transaction opportunity, and 𝑥K are the independent variables

mentioned above, including an intercept. The data include the entire history of seller feedback scores up to

period t. We then average the relative entropy across transactions to have an aggregate measure per period,

17

which we plot in Figure 9 below. Intuitively, the larger the relative entropy, the easier it is to distinguish

between the three seller types based on their feedback scores, making the system more informative. Both

approaches yield the same informativeness ranking of the four treatments.

Figure 9: Relative entropy per treatment

The most informative system, i.e. the system leading to lowest entropy, is the “5-point, rate quality”

treatment. This supports our suggestion to design a feedback system with a clear mapping between the

feedback rating scale and the feedback elicitation question, linked to the dimension deemed most

informative for the feedback system, which in our case is quality. Remarkably, the feedback scores in the

R3 – Q treatment are the second-most informative. We see that as long as the feedback elicitation is directed

towards the informative signal, quality in our setup, a coarser scale is not problematic per se. However, if

the feedback elicitation question is unspecific, informativeness drops in both rating systems.

Still, as seen in Figure 8, the informativeness in all four treatments is lower than theoretically possible. The

reason is the noisy feedback giving that does not follow the optimal mappings outlined in the section above.

In the next section, we analyze the determinants of buyers’ feedback giving to better understand why and

how feedback giving diverges from the most informative behavior.

18

4 Understanding the drivers of buyer’s behavior

4.1 A hidden Markov mixture of experts model to unravel buyers’ unobserved feedback giving

rules

Feedback systems are believed to function in an equilibrium state where all players coordinate on the

feedback rule. But equilibrium is the end state reached via an iterative process that unfolds over time. We

aim to uncover the underlying feedback rules used by the players, the evolution of rules use and the degree

of state dependence. Studies have focused on the drivers of feedback giving rather than on the rules buyers

use to leave feedback (Gutt et al. 2018). The idea of rule switching has been documented in learning

behavior (Salmon 2004, Stahl 2001, Ansari et al. 2012).

The descriptives presented in the above section suggest that quality is a major determinant of the level of

feedback rating. However, other factors also seem to play a role in feedback giving. In the feedback stage

of our experiment, buyers are informed about their profit from the transaction, in addition to quality

received, the seller feedback score and the transaction outcome. Therefore, subjects might use a profit-

based rule to leave feedback. For the same level of quality received, players might give a lower rating if

their profit is low or even negative, or a higher rating if their profit is large. Given that our experiment

involved multiple periods, any attempt to model feedback giving rules will need to account for the

possibility that subjects switch between feedback rules throughout the 60 periods of the experiment,

assuming multiple feedback rules exist. The dynamics in feedback rule adoption can depend on the

feedback system design, and on players’ experience in previous rounds.

The feedback rule used by a player in a certain round is not observable. We can only observe the rating

players leave. We implement a non-homogeneous hidden Markov mixture of experts model (NH-HMME)

to reveal players transitions between leaving feedback based on profit vs. based on quality received over

the periods of the game. In our context, the “experts” are the two different feedback rules, the quality vs.

the profit rule. Players leave ratings that are conditional on, thus probabilistically aligned with the feedback

rule chosen at every period. The approach is popular in machine learning (need cites!!) and has been

introduced in economics to study learning rules dynamics. Ansari et al. (2012) studied how subjects switch

between reinforcement and belief learning in six games. A hidden Markov model (HMM) is more

commonly used to unravel latent dynamics in behavior. Instead of considering that the choice of rating is

driven by the profit rule or the quality rule, an HMM would assume that the conditional ratings distributions

come from the same model, i.e. the same family of distributions, but with different parameters. In our setup,

such a model is not identifiable because quality and profit are highly correlated (0.864 across all treatments),

which would lead to a highly colinear model were we to introduce them as explanatory variables in the

19

same model. Therefore, we propose a model where the two are mutually exclusive at each period, and

ratings can be consistent with a quality-based feedback rule or profit-based feedback rule. Subjects can

switch between feedback rules after every period of the auction, following a first order Markov decision

process governed by a full transition matrix.

The non-homogeneous nature of the model allows for time-varying transition propensities between

feedback rules, likely influenced by experimental outcomes. For instance, disappointing outcomes such as

a loss (negative profit), in the previous period of the auction can motivate subjects to switch states, and

change their feedback giving rule.

Figure 10: The hidden Markov mixture of experts model of feedback giving rules

Model specification: The HMME model has 3 components:

1. Start probabilities: the initial state membership probability, 𝜋K , showing player i’s probability to

give feedback based on profit vs. quality rule s, in period 1.

2. (Non-homogeneous) transition probabilities ΛKP: the likelihood of transitioning between feedback

rules s in period t. These probabilities are time varying, i.e. non-homogeneous, and depend on the

game outcomes in the previous period t-1.

3. Emission probabilities 𝑃K P(𝑌X = 𝑗|𝑆KP = 𝑠): the feedback rating choice probabilities conditional on

the state (feedback rule) chosen by the subject.

Initial state probabilities: Let 𝑠 = 𝑄, 𝑃 denote a latent feedback rule state, where 𝑠 = 𝑄 if the subject gives

feedback according to the quality rule, and 𝑠 = 𝑃 if the subject uses the profit rule. The probability that a

player is initially in state s, i.e. the initial state membership, is given by the following simplex:

𝜋 = ¥𝜋¦, 𝜋§�, 𝑠. 𝑡. 𝛴 𝜋 = 1; 𝜋¦ = 1/(1 + exp(−𝛾)) (4)

20

We reparametrize the start probabilities using an inverse-logit transformation.

Transition probabilities: The transition probability shows how subjects’ transition between feedback states,

at period t:

ΛKP = ¯𝜆KP¦¦ 𝜆KP¦§ = 1 − 𝜆KP¦¦

𝜆KP§¦ = 1 − 𝜆KP§§ 𝜆KP§§± (5)

where 𝛴 𝜆KP ² = 1. Each element of the matrix 𝑞KP ² shows the probability of a player to transition from

feedback rule s’ at period t-1, to feedback rule s in period t. The players’ probability to transition from one

state to another is impacted by observed outcomes, and by their intrinsic tendencies to transition, captured

by the intercepts in the equations below:

𝜆KP¦¦ =³��(´µ`¶µ�·¸)<`³��(´µ`¶µ�·¸)

; 𝜆KP§§ =³��(¹º`»º�·¸)

<`³��(¹º`»º�·¸). (6)

where 𝜌 represents player’s tendency to remain in a certain state from one period to another, and 𝜏

captures the impact of game outcomes on the propensity to keep using the same rule. 𝑥KP are the time varying

covariates likely to impact switching between rules. We explore different game outcomes that could impact

the transition matrix, which we mention in the empirical estimation section. Note that most outcomes are

revealed in the same period t, before the feedback elicitation question, thus can impact the choice of

feedback rule in that period.

Emission probabilities: Conditional on using a certain rule s at period t, the probability of rating j is given

by a multinomial logit specification. Subjects can leave feedback on a three-point scale j={1,2,3} in two

treatments, and on a five-point scale j={1 to 5} in the remaining two treatments. A subject i leaves a rating

j at time t with probability 𝑃K P(𝑌X = 𝑗|𝑆KP = 𝑠), conditional on the feedback rule s used:

𝑃𝑟K P(𝑌X = 𝑗|𝑆KP = 𝑠) =³��(�½¾¿·¸

À)Á�Â7:Ä³��(��¾¿·¸

À). (7)

For identification purposes, we set all 𝛽< = 0, and interpret log-odds against the outcome feedback j = 1.

The time-varying set of explanatory variables 𝐾KP differs given the state s. The experts in our HMME model

are the two feedback giving rules. If a player gives feedback according to the quality rule, then 𝐾KP¦ =

[1, 𝑄KP/20,𝐹𝑒𝑒𝑑𝑏𝑎𝑐𝑘𝑠𝑐𝑜𝑟𝑒KP]. If a player gives feedback according to the profit rule, their feedback is

explained by their profit at period t, such that 𝐾KP§ = [1, ΠKP/10, 𝐹𝑒𝑒𝑑𝑏𝑎𝑐𝑘𝑠𝑐𝑜𝑟𝑒KP]. We divide quality by

20 to match the feedback and feedback score range, and profit by 10 to ensure better sampling of the

parameters reflecting the impact of profit on the feedback given.

21

Likelihood: The likelihood of observing a sequence of feedback rules and conditional ratings over

the T periods of the experiment, written in matrix form, is:

𝐿KÌ\𝑆K(1), … . . 𝑆K(𝑇)] = 𝝅𝒔²𝑷𝒓𝒊𝒔𝟏ΠPMuÌ 𝚲𝒊𝒕𝑷𝒓𝒊𝒔𝒕𝟏. (8)

where 𝝅𝒔² is given by equation 4,𝚲𝒊𝒕 is the transition matrix in equation 5, 𝑷𝒓𝒊𝒔𝒕 is a diagonal matrix with

state specific probabilities given by equation 7, and 𝟏 is a vector of ones.

Model selection: We estimate our model using Hamiltonian Monte Carlo simulation (Gelman et al. 2014)

in a hierarchical Bayesian framework, and implement forward filtering techniques following the likelihood

function in equation 8. We use measures of model fit and predictive accuracy to establish whether the

proposed NH-HMME is the best suited model despite its complexity9. In Table 2, we report log-predictive

density (LPD) and the Watanabe-Akaike Information Criterion (WAIC), which both account for model fit

and model complexity10. We also compute the mean-square error (MSE) and hit rates.11

log-predictive density

WAIC

Model R3U R3Q R5U R5Q R3U R3Q R5U R5Q

Quality model -1,270 -966 -2,353 -1,985 2,537 1,924 4,699 3,928

Profit model -1,508 -1,344 -2,613 -2,604 3,019 2,694 5,230 5,290

HMME -1,070 -926 -2,380 -2,131 2,053 1,769 5,570 5,509

NH-HMME -1,068 -922 -1,963 -1,687 2,048 1,749 3,644 2,979

MSE

HIT RATES (%)

Model R3U R3Q R5U R5Q R3U R3Q R5U R5Q

Quality model 0.0988 0.0505 0.1478 0.1478 68.59 77.49 61.55 61.48

Profit model 0.1537 0.1102 0.3245 0.2991 60.78 66.79 43.05 45.28

HMME 0.0635 0.0409 0.0909 0.0407 74.80 79.77 70.00 79.86

NH-HMME 0.0645 0.0410 0.1763 0.1120 74.59 79.76 58.06 66.54

Table 2: Model fit and predictive accuracy. Note that the HMME model does not converge well for the 5-star

treatments, further supporting the need for a non-homogeneous model.

9 In appendix B we provide the results of a simulation study, showing model parameter recovery and hidden states estimation stability. 10 The log predictive density (Gelman 2014) is computed as the posterior mean of the log-likelihood function evaluated in each draw of the HMC sampler. Note that LPD is based on the forward-filtering probabilities, while WAIC is based on Viterbi decoding. 11 For each individual at each click, the squared error is computed as the square of the difference between the rating probability, and the actual behavior (j = {1,2,3} or j={1 to 5}). The mean is computed across visitors and clicks. Based on the above predicted probabilities, we draw a rating j and compute the percentage of correctly predicted ratings at every iteration of the HMC. We report the average hit rates across iterations. We expect hit rates to be lower in the “5 star” treatments than in the “3 star” treatments, since it is harder to predict a choice among 5 options rather than 3 options.

22

Results show that the proposed NH-HMME model outperforms a series of nested models, such as a model

where feedback is explained the quality rule only, another model where feedback is explained profit rule

only12, and a HMME where transition matrices are not time-varying. The likelihood-based measures (log-

predictive densities and WAIC) strongly support the NH-HMME model, while the evidence is mixed when

looking at non-likelihood based measures (hit rates and mean square errors). Although the non-likelihood

based measures seem to support the HMME model, the model does not converge well in the “5-point”

treatments data (the Gelman-Rubin statistic 𝑅Ö > 1.1). We therefore conclude that the non-homogeneous

model is most appropriate in this setup.

Initial state probabilities: We report the results of the NH-HMME model. We report substantive results

highlighting players’ feedback rule dynamic here, and report all supporting results, including all parameter

estimates and their 95% highest density intervals in Appendix C. The experimental treatments can impact

buyers’ likelihood to decide to initially give feedback based on the profit vs. the quality rule. Before they

start the auction, players are instructed to leave feedback on either a 3-point or a 5-point scale, and that they

should rate the quality (directed feedback) or the transaction (general feedback).

The initial state probabilities reported in Table 3 show that, in all treatments, subjects are more likely to

start by giving feedback based on quality received. When asked to leave a rating (general feedback),

subjects are more likely to initially use the profit rule (.284 and .362 for the 3-point scale and the 5-point

scale treatment respectively), than when asked to rate the quality received (.19 and .194 for the 3-point scale

and the 5-point scale treatment respectively). Interestingly, subjects are more likely to initiate the game

using the profit rule in the “5-point, rate transaction” treatment than in the “3-point, rate transaction”

treatment, even though in the “5-point, rate transaction” treatment the quality scale matches the feedback

scale, and a fluency argument (cite psych paper on fluency) would suggest that it is easier for subjects to

rate quality received. As expected, the overall probability to initially use the quality rule is highest in the

directed feedback treatments, where subjects are asked to rate the quality received.

State transitions: To illustrate the average feedback rule dynamics, and the impact of game outcomes on

the probability to transition from feedback rule s to feedback rule s’, we report in Table 3 the transition

matrices at high, average and low levels of the feedback score.

To test whether game outcomes impact state transitions, we considered the impact of several variables on

the transition probabilities. We considered the effect of whether the buyer made a loss in the current

12 We also estimated a hidden Markov model (HMM) where quality and profit are both integrated in the emission equations, and their coefficients are allowed to vary by state. As explained, due to the high correlation between the quality and profit, the model is not well-identified, and the Gelman-Rubin statistic shows that the model does not converge ( 𝑅Ö > 1.1). The results are available upon request.

23

period13, buyers’ expected payoffs, expected quality, the differences between actual and expected payoffs,

and the differences between expected quality and quality received. Game outcomes have limited but

significant impact to induce a transition between rules. The best fitting model is one where sellers feedback

score available to buyers before the transaction occurs explains a state transition, but the effects are rather

small (see Appendix C). Note that we also control for feedback score in the emission probability model.

This suggests that feedback score has long-term impact on buyers’ behavior, impacting their transitions

among feedback rules, as well as a short-term effect, impacting the ratings given, conditional on the chosen

feedback rule at any period.

Looking at Table 3, the high values of the diagonal elements in the average transition matrices, most above

90%, suggest that states are very sticky, and players are likely to repeat the feedback rule used in previous

periods of the game. Players rarely transition from the quality to the profit rule, but are slightly more likely

to revert to the quality rule if they previously used the profit rule as a basis for their ratings. Players are

most likely to switch from the profit to the quality rule in the directed feedback treatments, particularly in

the “5-point, rate quality” condition (0.081 vs. approx. 0.03 in all other treatments).

Across all treatments, there seems to be a higher variability in feedback rule dynamics when feedback

scores are low, compared to average or high feedback sores.

Transition matrix

Baseline: Average

feedback score

With Low

feedback score

With high

feedback score

Average start probabilities

Condition (t-1) to t Q P Q P Q P

3-point, rate transaction Q 0.9996 0.0004 0.9964 0.0036 1 0 0.716

P 0.0333 0.9667 0.0494 0.9506 0.0224 0.9776 0.284

3-point, rate quality Q 0.997 0.003 0.9818 0.0182 0.9953 0.0047 0.81

P 0.0258 0.9742 0.1073 0.8927 0.0372 0.9628 0.19

5-point, rate transaction Q 0.9947 0.0053 0.9843 0.0157 0.9982 0.0018 0.638

P 0.0245 0.9755 0.0584 0.9416 0.0101 0.9899 0.362

5-point, rate quality Q 0.9983 0.0017 0.9912 0.0088 0.9997 0.0003 0.806

P 0.0809 0.9191 0.095 0.905 0.0687 0.9313 0.194

Table 3: Average start probabilities, transition matrices, and changes in transition probabilities as a function of sellers’

feedback score, across treatments. States seem particularly sticky. Note: Feedback scores used to compute the

13 Since players are informed about a seller feedback score, quality received and their profits before they are asked to leave a rating, current profits might influence players to switch between feedback rules.

24

transition matrices are: FS={low=1.2, average=2, high=2.8} for the 3-point scale treatments, and FS={low=2,

average=3, high=4} for the 5-point scale treatment.

Figure 11: Population level smoothed state probabilities across the 60 periods of the game. The plot shows population

level feedback rule dynamics. Note that transition matrices are time-varying, depending on feedback score.

In Figure 11, we plot the population level smoothed state probabilities across the 60 periods of the game.

The plot shows population level feedback rule dynamics. transition matrices are allowed to vary over time,

as a function of sellers’ feedback score. The plot highlights the need to use a time-varying, i.e. non-

homogeneous model to account for the impact of game outcomes. As seen in Figure 11, there are substantial

differences in feedback rule dynamics across treatments. In the “5-point, rate quality” and the “3-point, rate

transaction” treatments exhibit stronger dynamics in feedback rules than in the “5-point, rate transaction”

and the “3-point, rate quality” treatments.

In the “5-point, rate quality” treatment in particular, subjects seem to learn to use the quality rule throughout

the game, as the profit rule is much less sticky at 92%, than the quality rule, which hovers around 99%.

Therefore, once users switch to the quality rule, they are far less likely to switch back to the profit rule.

State-specific behavior. We report the parameter estimates of the emission probability models in Appendix

C. The parameter estimates are in line with our expectations, and show that quality and profits have a

positive impact on feedback scores. The higher the quality received and the higher the profit, the more

likely we are to observe a higher rating vs. a rating of one (our baseline). Conditional on feedback rule, we

25

computed probabilities of giving a specific feedback level, using a multinomial logit model, and plot the

results in figures 12 and 13.

Figure 12: Predicted rating probabilities at different levels of quality and feedback score, conditional on using the

quality rule, across all treatments.

Feedback analysis conditional on the quality rule. Subjects in both “5-star” treatments seem to follow very

similar strategies (blue and red lines very close at most levels of the feedback score and quality, in Figure

12). This is also supported by the fact that most of the 95% HDIs of the parameter estimates overlap (see

Appendix C). When comparing the use of the quality feedback rule across the “3-star” treatments,

surprisingly, in the “3-star, rate quality” treatment, ratings are more impacted by feedback scores than in

the “3-star, rate transaction” treatment.

Feedback score has a much smaller impact on ratings than quality received, and makes more of a difference

at low and intermediate levels of quality received. Interestingly, the feedback score coefficient is negative,

and it seems to accommodate a disappointment or punishment effect: when quality received is low, the

higher the feedback score the lower the feedback given; when quality received is high, the impact of a

seller’s feedback score on rating given is much smaller.

In the “5-star” treatments, when the feedback score was lower than the quality received (quality exceeds

expectations), the average likelihood of giving a feedback of one level below quality received is 0.065 for

26

the directed feedback condition, and 0.067 for general feedback condition. When the feedback score was

higher than the quality received (subjects are disappointed with the quality received), the average likelihood

of giving a feedback of one level below quality received is 0.242, and 0.245 for directed vs. general

feedback elicitations, respectively. When the feedback score matches quality received, the average

likelihood of giving a feedback of one level below quality received is 0.109, and 0.111, for directed vs.

general feedback respectively. The disappointment of not receiving the expected quality (as suggested by

the seller’s feedback score) seems to lead subjects to punish sellers by leaving lower ratings. If subjects

were to Bayesian update, they should consider the feedback score as a prior and give sellers a higher

feedback rating than the quality received. We should therefore observe the opposite trend.

The same analysis for the 3-star treatments requires an extra step to impose an optimal mapping between

feedback rating and the quality received, and infer which rating is lower than warranted by the quality

received. Nonetheless, Figure 12 appears to supports the same trend for the “3-star, rate quality” treatment,

but not for the “3-star, rate transaction” treatment. Looking at the “3-star, rate quality” treatment, when

subjects receive 40% or 50% quality (2 or 3), they seem to consider feedback score to guide their rating.

When quality received is 40% (2), and feedback score is 3, subjects are much more likely to give a feedback

of 1 (.75), suggesting a punishment effect due to subjects’ disappointment from receiving a less than

expected quality. When the reverse is true, and quality received is 2 whereas feedback score is 1, subjects

become very lenient, and are 75% likely to leave a rating of 2 instead of a rating of 1. When feedback score

matches quality received at 2, the likelihood of rating is evenly split between 1 and 2. Feedback score does

not make a difference when quality received is 80%. Regardless of feedback score, subjects are 75% to

80% likely to leave a rating of 3, irrespective of the treatment.

27

Figure 13: Predicted rating probabilities at different levels of profit and feedback score, conditional on using the profit

rule, across all treatments.

Feedback analysis conditional on the profit rule. Sellers’ feedback score seems to be more impactful when

players decide to use the profit rule, and has a positive impact on feedback. Most parameters estimates are

either significantly different than 0 (the 95% HDIs do not contain 0), or directionally higher than 0 (see

Appendix C). The higher the feedback scores, the higher the likelihood of leaving a higher rating, for the

same level of profit, particularly in the “Rate transaction” treatments.

Overall, results show that feedback rule dynamics are strongly impacted by initial state probabilities, and

to some degree by game outcomes altering state transition probabilities. Start probabilities are likely

impacted by the feedback system design. In the initial period, the design of the market is their only available

information. To be successful, feedback systems must be designed to induce traders to give accurate and

useful feedback. When the feedback scale and the feedback elicitation question are mapped on the same

dimension, as in our “5-point, rate quality” treatment, buyers are able to focus their feedback ratings on the

dimension that is most diagnostic of seller performance.

28

5 Discussion and future directions

In this paper, we aim to uncover principles towards the design of an informative feedback system. A

feedback system is characterized by its information structure, including the message space, the feedback

aggregation function and the mapping function used by buyers to rate sellers’ performance and hence

depends on decisions made by the platform and by market participants. We implement an experimental

design where we vary two components of the feedback system: the feedback elicitation question and the

rating scale. The feedback elicitation question asks for directed (rate quality) vs. general feedback (rate

transaction). The rating scale accommodates designs typically used in practice, with a 5-point scale that in

practice implies a rating of product quality, and a 3-point scale that usually implies a negative, neutral or

positive experience of the interaction with the seller. Our two-by-two full factorial design is aimed to mimic

four online market designs, in which the feedback elicitation question is mapped or not into the most

predictive dimension of seller performance.

We choose to implement a lab experiment over seeking other types of data to study feedback system design

because of the tight control of lab experiments over field experiments or observational data. One key feature

of our experimental market is the compulsory feedback, intended to limit any selection bias introduced by

traders’ choice to not leave feedback following a transaction. This is crucial, as literature has shown that

the decision to abstain from leaving feedback is not random but driven by transaction experience (Bolton

et al. 2018, Dellarocas and Wood 2008), or social concerns. We avoid any confounding effects on

information transmission by imposing all traders to leave feedback after each transaction. In our

experiment, traders cannot communicate with each other, reducing any concerns related to empathy or

social pressure. Additionally, subjects are fully informed about the distribution of seller types, reducing

buyer’s uncertainty w.r.t. to expected quality, compared to an online platform. Although artificial in several

respects, our experiment retains the key features of market design relevant in practice. Buyers have private

valuations for the traded goods, can make profits or losses from a transaction, and feedback scores are

highly informative for buyers’ behavior.

Sound principles for the design of market feedback systems are principles that optimize the informativeness

of the system by focusing on the important forecast variables. Our results show that when feedback

elicitation question and the feedback scale align on a common dimension deemed most relevant on a

particular marketplace, the feedback system is most informative, able to forecast sellers’ performance.

This main conclusion is supported by an informativeness analysis, followed by a second analysis which

uncovers the buyers feedback giving rules and the drivers of rating behavior. We use two approaches to

analyze system informativeness measured by relative entropy, which measures the improvement of using

29

feedback information to form beliefs instead of using prior beliefs about the seller typos. The first is based

on the theoretically expected belief distribution given our model across the four market designs, using the

empirical mappings of quality into feedback score. The second approach estimates the belief distributions

directly from the experimental data available to the subjects using the multinomial regressions. Both

methods reveal that the most informative feedback system is the directed “5-point, rate quality” treatment,

which maps the feedback elicitation question and feedback rating scale into the quality scale.

Our empirical analysis further reveals the dynamics in the underlying feedback giving rules used by buyers

throughout the transaction periods. Buyers transition between leaving feedback based on the quality

received and based on their profits from a transaction, and the transition propensities are influenced by

sellers’ feedback score. Interestingly, even after accounting for its long-term impact on the chosen feedback

rule, feedback score seems to impact sellers’ ratings in the short run. This is suboptimal, as buyers should

only use the quality received as a proxy for seller performance when leaving a rating after the transaction

has occurred (assuming a transaction occurred). Feedback scores should only be used for deciding on the

bid. Our analysis suggests that feedback score seems to accommodate a buyers’ disappointment of receiving

a low quality. Buyers seem to punish sellers who perform worse than the feedback score would suggest, by

leaving a lower rating than expected given the quality shipped by the seller.

Our most informative system, i.e. the system leading to largest relative entropy, is the “5-point, rate quality”

treatment. This supports our suggestion to design a feedback system with a clear mapping between the

feedback rating scale and the feedback elicitation question, linked to the dimension deemed most

informative for the feedback system, which in our case is quality. A fluent design allows people to easily,

almost automatically, leave feedback on desired dimension. The system with the most fluent design leads

to more accurate feedback. The results are supported by phycological studies which highlight that fluency

leads to ease of processing information (Alter and Oppenheimer 2009), and significantly increases users’

experience with an online system.

Online platforms can readily implement the conclusions of our analysis by identifying the dimension most

diagnostic of seller performance, and designing a fluent feedback system that predicts seller’s performance

on that dimension. As future buyers are aware of the feedback system, further studies could test whether

the system alignment also leads them to a priori interpret sellers’ feedback scores on the desired dimension,

before even the first transaction. Our experimental design offers a high degree of flexibility, and would

allow other researchers to test various feedback systems under specific setups.

Our analysis involves computerized sellers. As next steps, we would integrate human sellers that can

strategically decide quality shipped in order to maximize profits. Such an extension of our experimental

30

design to a moral hazard setting would allow us to analyze whether the feedback system design features

found to be optimal in the current adverse selection setup would also be consequential in a moral hazard

setting. Otherwise, there may a need to provide incentives for trustworthiness due to seller’s strategic

behavior, and adapt the feedback system to account for such behavior.

31

References. Alter AL, Oppenheimer DM (2009). Uniting the Tribes of Fluency to Form a Metacognitive Nation.

Personality and Social Psychology Review 13(3), 219-235.

Ansari, A., Montoya, R., Netzer, O. (2012). Dynamic Learning in Behavioral Games: A Hidden Markov

Mixture of Experts Approach. Quantitative Marketing and Economics 10 (4), 475-503.

Bajari P., Hortacsu A. (2004). Economic insights from internet auctions. Journal of Economic Literature,

42(2), 457–486.

Bauerly, R.J., (2009). Online auction fraud and eBay. Marketing Management Journal 19, 133-143.

Becker, G.M, DeGroot, M.H., Marschak, J. (1964). Measuring Utility by a Single-Response Sequential

Method. Behavioral Science, 9(3), 226-232.

Bolton, G.E., Greiner, B., Ockenfels, A., (2013). Engineering Trust - Reciprocity in the Production of

Reputation Information. Management Science 59, 265–285.

Bolton, G.M., Kusterer, D.J., Mans, J., (forthcoming). Inflated reputations, Uncertainty, Leniency and

Moral Wiggle Room in Trader Feedback Systems. Management Science.

Cokely, E. T., Galesic, M., Schulz, E., Ghazal, S., Garcia-Retamero, R. (2012). Measuring risk literacy:

The berlin numeracy test. Judgment and Decision Making, 7(1), 25–47.

Cover, T. M. Thomas, J. A. (2006). Elements of Information Theory. Wiley-Interscience, Second edition.

Crosetto, P., Filippin, A. (2013). The “bomb” risk elicitation task. Journal of Risk and Uncertainty, 47(1),

31– 65.

Dellarocas, C., Wood, C.A., (2008). The Sound of Silence in Online Feedback: Estimating Trading Risks

in the Presence of Reporting Bias. Management Science 54, 460–476.

Gelman, A., Carlin, J., Stern, H., Dunson, D., Vehtari, A., Rubin, D. (2014). Bayesian Data Analysis, CRC

Press, 3rd edition.

Golman, R., Bhatia, S. (2012). Performance evaluation inflation and compression. Accounting,

Organizations and Society, 37(8), 534–543.

Gregg, D. G., Scott, J. E. (2006). The role of reputation systems in reducing on-line auction fraud.

International Journal of Electronic Commerce, 10(3), 95–120.

32

Gutt, D., Neumann, J., Zimmermann, S., Kundisch, D., Chen, J. (2018). Design of review systems – a

strategic instrument to shape online review behavior and economic outcomes. Working paper.

Kauffman, R. J., Wood, C. A. (2006). Doing their bidding: An empirical examination of factors that affect

a buyer’s utility in internet auctions. Information Technology and Management, 7(3), 171–190.

Nosko, C., Tadelis, S. (2015). The limits of reputation in platform markets: An empirical analysis and field

experiment. NBER Working Paper No. 20830.

Payne, J. W., Bettman, J. R., & Johnson, E. J. (1988). Adaptive strategy selection in decision

making. Journal of Experimental Psychology: Learning, Memory, and Cognition, 14(3), 534-552.

Resnick, P., Zeckhauser, R., Swanson, J., Lockwood, K. (2006). The value of reputation on eBay: A

controlled experiment. Experimental Economics, 9(2), 79–101.

Roth, A.E., (2002). The Economist as Engineer: Game Theory, Experimentation, and Computation as Tools

for Design Economics. Econometrica 70, 1341–1378.

Tadelis, S., (2016). Reputation and Feedback Systems in Online Platform Markets. Annual Review of

Economics 8, 321–340.

Salmon, T. (2004). Evidence for learning to learn behavior in normal form games. Theory and Decision,

56(4), 367–404.

SoPHIE labs: https://www.sophielabs.com

Stahl, D. (2001). Population rule learning in symmetric normal-form games: theory and evidence. Journal

of Economic Behavior and Organization, 1304, 1–17.

Veldkamp, L. L. (2011). Information choice in macroeconomics and finance. Princeton University Press.

33

Appendices

Appendix A. Details of the experiment

1.1. Screenshots of the simulation used to demonstrate that it is optimal for bidders to bid their valuation.

35

Appendix B: Simulation study

We report a simulation study to assess the empirical identification of the non-homogeneous HMME model

parameters. We generate data very similar to our experimental data, to keep the simulated data as close as

possible to the empirical application. To that purpose, we use the independent variables from our empirical

dataset, from the “3-star: rate transaction” treatment. We simulate 54 individuals, who participate in

auctions for 30 periods. The levels of quality received, buyer valuations for the traded goods, buyers’ profits

and sellers’ feedback scores are the same as in the empirical application. We simulate behavior using the

proposed NH-HMME model, and the model parameters are close to the values estimated from the empirical

application (see Table B1.1 and B1.2 on parameter recovery).

Highest Density Interval Effective

True parameter Estimated mean 2.5% 97.5% Sample size 𝑅Ö

Quality rule

𝛽u,×ØÙÚÛÜÚÝÙ -5 -2.905 -4.078 -1.862 685.149 0.999

𝛽u,Þßàá×â 2 1.802 1.439 2.215 750.817 0.999

𝛽u,ãÚÚäåàÜæçÜèÛÚ -0.1 -0.915 -1.578 -0.358 584.268 1.002

𝛽v,×ØÙÚÛÜÚÝÙ -15 -9.869 -11.674 -8.228 776.121 0.999

𝛽v,Þßàá×â 5 4.259 3.622 4.948 607.354 0.999

𝛽v,ãÚÚäåàÜæçÜèÛÚ -0.2 -1.539 -2.351 -0.756 651.036 1.005

Profit rule

𝛽u,×ØÙÚÛÜÚÝÙ 0 -1.460 -2.535 -0.411 723.977 1.004

𝛽u,ÝÛèã×Ù 0.1 0.088 0.056 0.122 1000.000 0.999

𝛽u,ãÚÚäåàÜæçÜèÛÚ -0.05 0.687 0.160 1.233 698.243 1.004

𝛽v,×ØÙÚÛÜÚÝÙ -3 -4.356 -6.364 -2.410 732.033 1.009

𝛽v,ÝÛèã×Ù 0.15 0.138 0.076 0.198 1000.000 0.999

𝛽v,ãÚÚäåàÜæçÜèÛÚ 0.1 0.714 -0.336 1.696 694.871 1.005

Table B1.1: Parameter recovery, with data generated from the NH-HMME model, for the emission model

36


True parameter Estimated mean 2.5% 97.5% Sample size 𝑅Ö

ρ¦ 2 2.098 0.155 4.176 804.819 1.000

ρ§ 1 1.388 -1.173 4.147 791.055 1.000

τ¦ 0.5 0.423 -0.582 1.435 773.110 0.999

τ§ 0.5 0.378 -1.050 1.804 772.107 0.999

πì 0.7 0.631 0.433 0.823 1000.000 1.001

πí 0.3 0.369 0.177 0.567 1000.000 1.001

Log-posterior -1187.82 -1195.06 -1183.04 301.454 0.999

Table B1.2: Parameter recovery, with data generated from the NH-HMME model, for the transition model

Model parameters are well-recovered, as the 95% highest density intervals of the recovered parameters

contain their true values.

We also explored the identification of the hidden feedback rules. Figure B1 shows that the probabilities of

using the quality rule against the filtered probabilities, and the proportion of the quality rule, as estimated

through Viterbi decoding. The three quantities are very similar. The overall hidden state recovery rate is

85.62%

Table B1: True vs. filtered probabilities using the quality rule, and quality state proportions recovered using

Viterbi decoding.

37

Appendix C: Results of the NH-HMME model

C1: Model parameter estimates and 95% highest density intervals for the “3-star, rate transaction”

treatment.


Mean 2.5% 97.5% Sample size 𝑅Ö

Quality rule

𝛽u,×ØÙÚÛÜÚÝÙ -6.849 -8.090 -5.845 452.484 1.000

𝛽u,Þßàá×â 3.197 2.846 3.594 372.926 1.004

𝛽u,ãÚÚäåàÜæçÜèÛÚ -0.078 -0.546 0.416 505.101 0.998

𝛽v,×ØÙÚÛÜÚÝÙ -19.274 -21.187 -17.675 491.895 1.005

𝛽v,Þßàá×â 6.663 6.100 7.302 346.514 1.007

𝛽v,ãÚÚäåàÜæçÜèÛÚ -0.154 -0.909 0.529 478.314 0.999

Profit rule

𝛽u,×ØÙÚÛÜÚÝÙ -0.013 -1.135 1.149 378.613 0.999

𝛽u,ÝÛèã×Ù 0.076 0.033 0.124 524.080 1.001


𝛽v,×ØÙÚÛÜÚÝÙ -3.458 -5.085 -1.812 461.768 1.004

𝛽v,ÝÛèã×Ù 0.133 0.069 0.196 494.422 0.999

𝛽v,ãÚÚäåàÜæçÜèÛÚ 0.906 0.102 1.672 387.666 1.004

Table C1.1: Emission model estimates for the “3-star, rate transaction” treatment



ρ¦ 2.228 -0.951 5.329 600.000 1.007

ρ§ 2.343 -0.887 5.627 460.742 1.002

τ¦ 2.822 0.680 5.765 394.601 1.010

τ§ 0.512 -1.190 2.312 441.438 1.002

πì 0.716 0.562 0.846 600.000 0.999

πí 0.284 0.154 0.438 600.000 0.999

Table C1.2: Transition model estimates for the “3-star, rate transaction” treatment

38

C2: Model parameter estimates and 95% highest density intervals for the “3-star, rate quality” treatment.



Quality rule

𝛽u,×ØÙÚÛÜÚÝÙ -5.175 -6.095 -4.052 536.761 1.001

𝛽u,Þßàá×â 3.570 3.164 3.992 360.233 1.002

𝛽u,ãÚÚäåàÜæçÜèÛÚ -1.031 -1.482 -0.545 444.684 1.001

𝛽v,×ØÙÚÛÜÚÝÙ -19.497 -21.339 -17.544 600.000 0.997

𝛽v,Þßàá×â 7.672 6.991 8.286 342.561 0.999

𝛽v,ãÚÚäåàÜæçÜèÛÚ -1.378 -2.090 -0.662 515.993 1.005

Profit rule

𝛽u,×ØÙÚÛÜÚÝÙ -0.479 -2.248 1.207 376.352 1.005

𝛽u,ÝÛèã×Ù 0.444 0.295 0.630 536.345 1.000


𝛽v,×ØÙÚÛÜÚÝÙ -5.049 -7.286 -2.966 323.200 1.002

𝛽v,ÝÛèã×Ù 0.721 0.544 0.956 495.816 1.001

𝛽v,ãÚÚäåàÜæçÜèÛÚ 0.576 -0.603 1.624 365.510 1.004

Table C2.1: Emission model estimates for the “3-star, rate quality” treatment



ρ¦ 1.238 -1.742 4.210 437.827 1.000

ρ§ -0.148 -2.950 3.029 477.508 1.003

τ¦ 2.290 0.711 4.158 429.653 0.998

τ§ 1.889 0.223 3.352 396.496 1.003

πì 0.810 0.672 0.919 600.000 0.999

πí 0.190 0.081 0.328 600.000 0.999

Table C2.2: Transition model estimates for the “3-star, rate quality” treatment

39

C3: Model parameter estimates and 95% highest density intervals for the “5-star, rate transaction”

treatment.



Quality rule

𝛽u,×ØÙÚÛÜÚÝÙ -1.652 -2.550 -0.838 463.348 0.999

𝛽u,Þßàá×â 3.422 2.842 4.006 310.507 1.004


𝛽v,×ØÙÚÛÜÚÝÙ -6.856 -8.057 -5.667 530.018 1.002

𝛽v,Þßàá×â 6.636 5.948 7.417 279.252 1.005


𝛽r,×ØÙÚÛÜÚÝÙ -16.348 -17.916 -14.775 558.056 1.000

𝛽r,Þßàá×â 9.755 8.960 10.662 250.655 1.002

𝛽r,ãÚÚäåàÜæçÜèÛÚ -2.142 -2.731 -1.623 216.685 1.008

𝛽=,×ØÙÚÛÜÚÝÙ -29.979 -32.048 -27.835 600.000 0.999

𝛽=,Þßàá×â 13.099 12.158 14.086 265.331 1.002

𝛽=,ãÚÚäåàÜæçÜèÛÚ -2.473 -3.184 -1.847 254.040 1.000

Profit rule

𝛽u,×ØÙÚÛÜÚÝÙ 1.107 -0.109 2.290 366.422 0.997

𝛽u,ÝÛèã×Ù 0.062 0.022 0.102 600.000 1.001


𝛽v,×ØÙÚÛÜÚÝÙ -0.941 -2.063 0.379 285.049 1.005

𝛽v,ÝÛèã×Ù 0.143 0.102 0.184 494.772 1.002


𝛽r,×ØÙÚÛÜÚÝÙ -3.105 -4.379 -1.807 239.882 1.003

𝛽r,ÝÛèã×Ù 0.228 0.177 0.276 459.346 1.012

𝛽r,ãÚÚäåàÜæçÜèÛÚ 0.413 0.025 0.812 249.988 1.001

𝛽=,×ØÙÚÛÜÚÝÙ -4.693 -6.204 -3.108 214.325 1.010

𝛽=,ÝÛèã×Ù 0.284 0.229 0.342 392.190 1.010

𝛽=,ãÚÚäåàÜæçÜèÛÚ 0.589 0.154 1.030 223.695 1.008

Table C3.1: Emission model estimates for the “5-star, rate transaction” treatment

40



ρ¦ 1.959 -1.343 5.427 600.000 1.001

ρ§ 0.970 -2.557 4.646 461.907 1.000

τ¦ 1.091 -0.089 2.372 600.000 1.003

τ§ 0.905 -0.372 2.277 464.814 1.000

πì 0.638 0.479 0.787 600.000 1.009

πí 0.362 0.213 0.521 600.000 1.009

Table C3.2: Transition model estimates for the “5-star, rate transaction” treatment

C4: Model parameter estimates and 95% highest density intervals for the “5-star, rate quality” treatment.



Quality rule

𝛽u,×ØÙÚÛÜÚÝÙ -2.763 -3.522 -1.951 427.413 0.999

𝛽u,Þßàá×â 3.546 3.078 4.004 346.070 1.000


𝛽v,×ØÙÚÛÜÚÝÙ -8.879 -9.994 -7.812 400.690 0.999

𝛽v,Þßàá×â 7.215 6.603 7.926 271.993 1.005


𝛽r,×ØÙÚÛÜÚÝÙ -18.387 -19.846 -17.016 422.690 0.998

𝛽r,Þßàá×â 10.065 9.377 10.771 232.075 1.004

𝛽r,ãÚÚäåàÜæçÜèÛÚ -1.760 -2.250 -1.247 258.804 1.006

𝛽=,×ØÙÚÛÜÚÝÙ -32.836 -34.717 -31.037 501.201 1.000

𝛽=,Þßàá×â 13.733 12.976 14.521 224.273 1.009

𝛽=,ãÚÚäåàÜæçÜèÛÚ -2.244 -2.809 -1.678 273.720 1.006

41

Profit rule

𝛽u,×ØÙÚÛÜÚÝÙ -0.739 -2.412 1.135 508.889 0.999

𝛽u,ÝÛèã×Ù 0.048 -0.020 0.114 600.000 0.999


𝛽v,×ØÙÚÛÜÚÝÙ -2.306 -4.022 -0.817 600.000 1.004

𝛽v,ÝÛèã×Ù 0.072 0.004 0.132 554.966 0.998


𝛽r,×ØÙÚÛÜÚÝÙ -2.670 -4.394 -0.842 522.003 1.000

𝛽r,ÝÛèã×Ù 0.105 0.037 0.179 600.000 0.997

𝛽r,ãÚÚäåàÜæçÜèÛÚ 0.290 -0.354 0.891 540.582 0.999

𝛽=,×ØÙÚÛÜÚÝÙ -3.040 -4.897 -1.167 600.000 0.998

𝛽=,ÝÛèã×Ù 0.150 0.071 0.242 600.000 0.997

𝛽=,ãÚÚäåàÜæçÜèÛÚ 0.207 -0.392 0.773 600.000 0.998

Table C4.1: Emission model estimates for the “5-star, rate quality” treatment



ρ¦ 1.445 -1.532 4.273 411.812 1.007

ρ§ 1.900 -1.252 4.950 600.000 1.003

τ¦ 1.640 0.532 3.025 459.120 1.005

τ§ 0.177 -0.840 1.419 441.828 1.003

πì 0.806 0.671 0.919 600.000 1.001

πí 0.194 0.081 0.329 600.000 1.001

Table C4.2: Transition model estimates for the “5-star, rate quality” treatment

rate this transaction: towards principles for the design ... · 1 rate this transaction: towards...

Documents