rate this transaction: towards principles for the design ... · 1 rate this transaction: towards...
TRANSCRIPT
1
Rate this transaction: Towards principles for the design of
market feedback systems
Gary E. Bolton1, Alina Ferecatu2, David J. Kusterer3
— Preliminary draft —
— Please do not circulate —
Abstract: Market feedback systems are critical to enforcing trader trustworthiness many on and
offline markets. In this paper, we aim to outline guidelines towards how feedback systems should
be designed. We conduct an incentive aligned experiment mirroring an online market, to
characterizes buyer behavior, in terms of ratings given and use of feedback scores. We investigate
the friction in information transmission due to the choice of the feedback scale (3-point vs. 5-point
scale) and feedback elicitation question (general or directed feedback). We characterize the impact
of these feedback system design choices on system informativeness with an entropy model
forecasting sellers’ performance. Using a hidden Markov mixture of experts model, we investigate
whether buyers switch between giving feedback based on quality received, or based on their
profits. Our analysis suggests that feedback systems mapping the feedback request and elicitation
scale on the most diagnostic dimension of seller future performance are most informative.
Keywords: Market design, Recommendation systems, Online ratings, Entropy, Hidden Markov
Mixture of Experts models
1 O.P. Jindal Chair Professor of Management Economics, Jindal School of Business, University of Texas at Dallas, [email protected] 2 Assistant Professor of Marketing, Rotterdam School of Management, Erasmus University, [email protected] 3 University of Cologne, [email protected]
2
1 Introduction
Markets are increasingly designed, not simply evolved (Roth, 2002). Feedback systems are a designed
feature of many modern markets. These systems provide the markets with information on the past reliability
of trading partners, enabling the tit-for-tat strategies that enforce market trustworthiness. Yet little is known
about how feedback systems should be designed, while there is a good deal of evidence that there is room
for improvement. In this paper, we lay out a normative benchmark, involving simple Bayesian measures,
for gauging the success of a feedback system, and we take a first step towards establishing design principles
that might guide a trading system towards reaching this benchmark.
Figure 1 provides a bird’s eye view of how feedback systems integrate into the market. The elements of
the system in need of design are in the blue boxes: The message space used by traders to report a trade
experience, and the feedback aggregation mechanism combines reports to obtain an overall rating for an
individual trader. Traders interface with the feedback system in two ways and these are shown in green
print. Traders provide the feedback information as input to the system. They use the outputted ratings to
make decisions on their willingness-to-trade with potential partners in the market.
Implementing a feedback system requires a number of design choices. What information should be elicited
from traders? What format should this information take? What metrics should be used to aggregate
feedback information into trader ratings? How should the ratings be presented to traders? Casual
empiricism suggests certain norms in current practice. Traders are asked a single open-ended question,
such as ‘Rate this transaction”. Feedback is typically given on a Likert-type scale where the number of
ratings varies between 3 and 10 points, but binary feedback scales are used as well. Individual seller ratings
are aggregated by averaging, which is then posted to the marketplace as the traders’ overall feedback
ratings. Additional information, some platform specific, is often collected and displayed as well, such as
individual text comments and breakouts of individual score histories, but the average feedback score tends
to be the headline rating, the rating information most prominently displayed and most easily accessed. Our
investigation focuses on the design of the headline rating part of the system. Other aspects of the system
are likely important as well, but given their prominence in virtually all systems, the headline numbers strike
us as the place to start a systematic investigation.
In formulating design principles, it is important to have a clear benchmark for the objective being served.
The objective of a market feedback system is to provide ratings that are informative in two senses. First,
that a trader’s rating represents an accurate picture of that trader’s past performance. A system that fails in
the first regard risks its credibility with traders. Second, that a trader’s ratings have forecast value for that
trader’s future performance. A system that fails in this regard will not be useful to enforce trust in the
3
marketplace. Note that the second criteria does not necessarily follow from the first. A system might
accurately reflect the past but be suboptimal with regard to informativeness about future trader behavior,
perhaps because what it measures are not the best forecast variables. Alternatively, the feedback system
could fail to provide incentives for trustworthy behavior, such that sellers start “milking” the reputation
they have built previously. Sound principles for the design of these systems, therefore, are principles that
optimize the informativeness of the system by focusing on the important forecast variables.
In the normatively best feedback system, traders would accurately report all relevant aspects of their trading
partner’s performance, the system would aggregate this information as forecast guidance that would then
be used optimally by traders in deciding on their interactions with future trading partners. In practice, there
are two kinds of challenges to achieving such a system. The first is institutional. The need to aggregate
information about a trader across her various interactions requires a certain amount of uniformity in
reporting, meaning constraints on the message space. In addition, there is ambiguity about what constitutes
the best aggregation technique. Design mistakes in either regard, might lead to a loss of informativeness.
The second challenge is behavioral. The system relies on traders to report their experiences accurately.
And no matter how informative ratings might be in theory, trader mistakes made in application reduce the
use-value of the system. To be successful, feedback systems must be designed to induce traders to give
accurate and useful feedback.
Figure 1: A bird’s eye view of the integration of feedback systems in the marketplace
Decidewillingness-to-trade
Submitfeedback
4
Previous studies show that traders with favorable feedback scores are more likely to trade and at relatively
favorable terms (for overviews of the literature, see Bajari and Hortacsu 2004 and Tadelis 2016). However,
there is a sizable literature to show that the feedback systems presently used in practice have a number of
problems. Feedback scores are compressed, making inferences about seller behavior more difficult (Gregg
and Scott 2006, Dellarocas and Wood 2008, Bauerly 2009, Bolton et al. forthcoming). The early research
on eBay’s feedback system where both buyers and sellers could rate each other showed that ninety-nine
percent of all feedback left was positive (Resnick et al. 2006, Kauffman and Wood 2006). However, there
is evidence that the share of dissatisfied traders is much larger: Dellarocas and Wood (2008) used structural
estimation and found that 21% of buyers and 14% of sellers had a bad experience in rare coin auctions on
eBay. One reason is that dissatisfied buyers tend to remain silent more often, or even report a positive rating
out of fear of a negative retaliatory rating (Bolton et al. 2013). Silent transactions are not reported in most
systems, although they carry information about the sellers as they more likely to be transactions where
something went wrong. Nosko and Tadelis (2015) showed in a large-scale field study on eBay that buyer
satisfaction increases if search results are sorted in a way that also takes into account the amount of silent
transactions a seller had.
These issues suggest there are problems with the current feedback systems, either pertaining to how
feedback ratings are elicited or what the resulting information inputted is, or with the way the information
is aggregated and presented to users. Yet at present there is little in the way of a model for how feedback
is given or used to guide us (see Golman and Bhatia 2012 for a model of lenient feedback giving). On the
institutional side, we lack clear benchmarks for establishing the effectiveness of these systems. We do not
know what the optimal rating metrics are or the optimal aggregation technique. On the behavioral side –
and much related to the institutional questions - we do not understand how traders interact with these
systems at an elementary level: How do traders determine the rating that they give other traders? How do
they use feedback information to determine their willingness-to-trade and at what price? Absence answers
to these questions, it is hard to see how we can come up with answers to the visible problems mentioned
above.
We study a market stylized to capture an elementary trust problem that feedback systems aim to solve. The
setup is characterized by adverse selection with regard to the seller types. Buyers have incomplete
information about seller reliability. Specifically, sellers differ in the average quality of product they deliver,
although the quality in any transaction is noisy around the average associated with the sending seller type.
Prior to market open, the distribution of seller types is made common knowledge among buyers. After a
transaction is made, a buyer can give a numerical feedback rating to the seller. All the ratings a seller has
previously received are averaged, resulting in a seller feedback score that is shown to the next prospective
5
buyer matched with that seller. In essence, the feedback system is informative to the extent that the provided
feedback scores aid buyers in discovering a seller’s true market type prior to trading with that seller. The
nature of the quality of product in the market lends itself to a rather canonical mapping into feedback scores.
Assuming that buyers use this mapping, a straightforward Bayesian updating routine determines the
dynamically optimal informativeness of the system across periods of trade. We propose to measure the
informativeness by the entropy of the beliefs about the seller types, a concept adapted from information
theory.
We then implement this market in a laboratory setting to study the system’s actual informativeness. The
experimental market isolates human buyer behavior, ‘fixing’ seller as robots to play the role of the noisy
sellers, a feature that is known to the buyers. We can then compare buyer behavior, in terms of feedback
giving and use of feedback scores, to the normative benchmarks. We vary two features of the institutional
structure of the feedback system. The first feature is the question used to prime buyers to give feedback.
Typically, the question is open ended, such as “Rate this transaction”, “Rate your experience”, or “Leave a
rating”. Our normative model suggests that these questions should be more directed. The reason is that
scores are more informative if all ratings correspond to a common mapping. So in addition to the baseline
treatment where we ask “Leave a rating”, we run treatments that ask a directed “Rate the quality you
received”. The second feature we manipulate is the scale for scoring. In many systems, it is a five point
Likert scale (eBay’s DSR scoring was inspired by this, see Bolton et al. 2013). Our normative model
suggests the scale should be chosen tailored to what the priming questions directs to rate. For the quality
in the experiment, a more detailed scoring range should be more informative than a less detailed one. Thus,
in addition to a 5-point scale, we consider a 3-point scale; the latter’s informativeness can be close to the
former’s if used with an appropriate mapping from quality to a rating, but decreases sharply otherwise.
The remainder of the paper is organized as follows. In Section 3, we describe the experiment design, the
normative model, and derive benchmarks for informativeness. In Section 4, we show descriptive results
and compare the empirical informativeness of the feedback system in the four treatment to the benchmarks.
In Section 5, we develop a Hidden Markov Mixture of Experts model to get more insight into what features
the subjects base their ratings on to show why informativeness is lower than theoretically possible. Section
6 concludes.
6
2 Experimental design and formal description of the model
2.1 Description of the experimental design
In our experimental market, human subjects in the role of buyers interact with computerized sellers offering
goods of unknown quality. Sellers differ in their type, that is, the distribution according to which they ship
different levels of quality. Buyers are informed that there are three types of sellers – called Type A (low
performers), Type B (medium performers), and Type C (high performers) - and their respective quality
distributions (see Figure 2). Hence, our markets are characterized by adverse selection.
Figure 2: Computerized sellers privately determine quality sent 𝑄 ∈ {20, 40, 60, 80, 100}. The distributions of quality
sent by seller type show the market accommodates low (A), medium (B), and high (C) performers.
The human buyers engage in repeated auctions to buy the goods offered by robot sellers over a series of
periods. In each period, one buyer is matched with exactly one robot seller and plays the stage game outlined
next (see also Figure 3).
In the auction stage, buyers learn their valuation for the good offered by the seller in the current period. It
is drawn randomly from a uniform distribution of all integers from 200 to 300. We use Experimental
Currency Units (ECU) as monetary unit. Buyers receive feedback information about the matched seller: the
number of transactions and the seller’s average feedback score (equal to the average of all previous
feedback ratings, and empty if the seller did not yet receive any feedback). Buyers are asked to submit their
maximum willingness to pay for the good. We use a BDM mechanism (Becker et al., 1964) to elicit buyers’
true willingness to pay for the product. The price is drawn randomly from a custom distribution of all
integers between 0 and 300. The distribution is left skewed to ensure a reasonable amount of transactions
(see the next section for the exact specification). Note that buyers’ optimal strategy is to bid their true
willingness to pay, which is their valuation multiplied with the expected quality based on the belief about
the seller type.
Type A Type B Type C
20% 40% 60% 80% 100% 20% 40% 60% 80% 100% 20% 40% 60% 80% 100%
10
20
40
Level of shipped quality
Prob
abilit
y
7
Figure 3: The experimental platform: the auction and feedback stages
In the feedback stage, buyers first learn whether the transaction took place. A transaction occurs if the
random price is lower than the bid. If the good is sold, the buyer has to pay the random price, learns the
level of quality the robot seller shipped (20% to 100% quality), and his profit in the current period. Buyers’
profits are equal to their valuation multiplied with the received level of quality minus the random price if
the buyer wins the auction, and 0 otherwise. Second, the buyers are asked to leave a rating. Leaving
feedback is compulsory, to avoid any confounds related to buyer selection in feedback giving. After
submitting a rating, the period ends and the next one begins.
Our treatments are implemented on the feedback page. We use a full factorial design between feedback
scale and feedback elicitation question and thus have four treatments in total. Subjects are asked to leave
general feedback (“Leave a rating”) vs. directed feedback based on quality (“Rate the quality”). Subjects
can leave feedback on a 3-point scale (matched with seller type), or on a 5-point scale (matched with
quality).
The stage game described above is repeated for 60 periods. Since buyers’ bids depend on the feedback
scores given by other players, all game observations are inherently correlated through the feedback
structure. Therefore, in order to have independent observations, we create matching groups of 6 human
players at the beginning of the experiment and consider one matching group to be one independent
observation. Within one matching group there are 6 sellers, two of each of the three types A, B, and C.
Sellers and buyers are randomly matched within matching groups at the beginning of each period. We use
8
the same random matching routine in all treatments, to reduce noise, such that matching, buyers’ valuation,
random price, seller type and seller quality are constant across matching groups.
At the beginning of the experiment, before the first period starts, subjects are made familiar with the quality
distributions of the different seller types and the BDM mechanism. On a first simulation screen for the
quality distributions, they see three empty graphs, one for each seller type. By clicking a button below each
of the boxes, one level of quality is drawn according to the seller’s distribution and shown as a histogram
in the graph. With each new draw the histogram is updated. They are encouraged to click often until they
see that the empirical distribution might differ at first but eventually converges to the theoretical
distributions outlined in Figure 2. On the second simulation screen, they see the random price distribution
used for the BDM mechanism. They are asked to select a hypothetical maximum price they are willing to
pay and an actual bid. The price distribution graph then is updated to show the fact that bids above the
willingness to pay can lead to transactions with a loss, and bids below the willingness to pay can lead to
foregone profitable trades (see Appendix A for screenshots). We made it transparent to the subjects that the
optimal way to bid is to bid exactly the willingness to pay in order to make the bid a good proxy for expected
quality.
After the main experiment was over, subjects were asked to fill in a socio-demographic questionnaire,
psychometric scales such as risk aversion (the bomb risk elicitation task, Crosetto et al. 2013), numeracy
(the Berlin Numeracy Test, Cokely et al. 2012), and answer a question regarding their feedback giving habit
on online platforms in real-life transactions (1-almost never to 5-almost always). They are also asked to
provide their reasoning on how they gave feedback ratings in the experiment. Once the experiment is over,
their earnings from the main game plus an initial endowment of 200 ECU are converted to euro (320 ECU
= 1 euro). The earnings from the main game, the risk aversion and numeracy tasks and a show-up fee of 4
euro are added and paid out to participants in cash.
We implemented and ran our experiment using SoPHIE, a web-based experimental software. We ran 8
experimental sessions in a physical lab; each session lasted about 90 minutes. One treatment (2 sessions)
was run in February 2018 and the remaining three treatments were run in June 2018, using a subject pool
from a large western European university. Subjects in this pool are accustomed to participating in incentive
aligned experiments involving no deception such as ours. 6 (2) sessions had 30 (24) buyers participate in
the auction against robot sellers, with 5 (4) matching groups per session (the number was lower in two
sessions due to no-shows).
9
2.2 Description of the formal model
In this section we describe the experiment again in terms of a formal model for the informativeness analysis
in the following section. Figure 4 gives an overview of the model. In each period 𝑡 = {1,2, … , 60}, bidders
are randomly matched with a seller. The seller’s type is drawn from Θ ∈ {𝜃1, 𝜃2, 𝜃3}, i.e., there are good,
average, and bad sellers. There is an equal number of each seller type such that buyers have a uniform prior
𝜇(𝜃) = 78 over Θ. A seller ships quality from the set 𝑄 = {1,… ,5} denoting the five quality levels between
20% and 100% in increments of 20 percentage points. Each seller ships quality with probabilities 𝑝; =
(𝑝;<,… , 𝑝;
=). As noted before, the good seller has shipping probabilities 𝑝;> = (0.1, 0.1, 0.2, 0.2, 0.4), the
average type ships quality according to 𝑝;@ = (0.1, 0.2, 0.4, 0.2, 0.1), and the bad type is defined by the
shipping probabilities 𝑝;A = (0.4, 0.2, 0.2, 0.1, 0.1). Denote the average quality shipped by each seller type
by 𝑞C;.
The platform collects information on the sellers and consists of a set of admissible feedback ratings, a
feedback rating aggregation function, and a feedback information display. In the experiment, ratings are
allowed in the set of integers from 1 to 𝑘, denoted by 𝑅F, and 𝑘 is either 5 or 3. The feedback aggregation
function 𝑓(𝑟) = 7I∑ 𝑟KLKM< is the average of all ratings received for a particular seller up to the current period,
(𝑟<, … , 𝑟L), where 𝑟 ∈ 𝑅F and 𝑛 is the number of ratings received . The feedback information display 𝐷P =
(𝑓(𝑟), 𝑛) consists of the average rating, called feedback score, and the number of ratings received, and is
shown to the bidder in period 𝑡.
After observing the realization of quality 𝑞 in a given period from a particular seller, buyers submit a
feedback rating 𝑟 to the platform. This rating is determined by a mapping from quality to rating using the
stochastic matrix
𝑀F = R𝑝<,< ⋯ 𝑝<,F⋮ ⋱ ⋮𝑝=,< ⋯ 𝑝=,F
V, (1)
where each row is the distribution of ratings conditional on quality 𝑞 ∈ 𝑄. Hence, the mapping from quality
into ratings induces a random variable 𝑅. For each seller, the probability distribution over 𝑅 is given by
𝜌; = (𝜌;<, … , 𝜌;
F), where 𝜌;K = ∑ 𝑝;
X𝑝X,K=XM< for seller type 𝜃 ∈ Θ. The mapping 𝑀F also induces a
distribution of the feedback information display. Observe that the elements of 𝐷 can be multiplied to give
the sum of ratings 𝑧 = 𝑓(𝑟)𝑛. The distribution of a sum of (identical) random variables can be obtained
using generating functions (see Wilf 1994). In our setup, we can use the polynomial 𝐺(𝑥) = \∑ 𝜌;𝑥KFKM< ]
L,
10
where the coefficient on 𝑥^ is the probability of observing a sum of 𝑧 after 𝑛 ratings.4 Hence we have 𝜁;,L =
(𝜁;,LL , 𝜁;,L
L`<, … , 𝜁;,LLF ) as the distribution of the sum of ratings for a given seller type 𝜃 after 𝑛 ratings have
been received.
The platform’s choices of message space 𝑅F, feedback aggregation function 𝑓(𝑟), and feedback
information display 𝐷, together with the mapping matrix 𝑀F chosen by the buyers define an information
structure 𝑆F = (𝑅F, 𝑓(𝑟), 𝐷,𝑀F) determining the updating process about the seller types. The information
structure induces the set of observable signals 𝑍 after a seller received 𝑛 ratings together with the associated
probabilities 𝜁L = (𝜁LL, 𝜁LL`<, … , 𝜁LLF) over the signals, where 𝜁LK = ∑ 𝜇(𝜃)𝜁;,LK
;∈c .
In each period, bidders form beliefs 𝛽(𝜃|𝑍L) about the seller types based on the feedback information
display provided by the platform. For each seller type 𝜃, the posterior belief is given by
𝛽(𝜃|𝑍L) =f(;)gh,I
i
∑ f(;)gh,Ii
h∈j. (2)
Based on their beliefs, bidders submit an integer-valued bid 𝑏 ∈ [0,300] to a BDM mechanism where the
price 𝑝 is drawn from a mixture distribution with CDF 𝐹(𝑥) = ∑ 𝑤K𝑃K(𝑥)rKM< , where 𝑃<(𝑥)~𝑈(0,60),
𝑃u(𝑥)~𝑈(61,120), 𝑃v(𝑥)~𝑈(121,200), 𝑃r(𝑥)~𝑈(201,300), 𝑤< = 0.6, 𝑤u = 0.2, 𝑤v = 0.15, and 𝑤r =
0.05. The expected profit of a bidder is 𝐹(𝑏)(𝐸;[𝑄|𝑍P]𝑣K − 𝐸[𝑝|𝑏 > 𝑝]), where 𝑣K is the bidder’s valuation
for the good at full quality. The expected quality 𝐸;[𝑄|𝑍P] is given by ∑ 𝛽;(𝑍P);∈c 𝑞C;, the belief-weighted
average of the expected qualities for the three seller types.
4 The probabilities can be computed explicitly as 7i!𝐺
(^)(0), which is the 𝑧-th derivative of 𝐺 at 𝑥 = 0.
11
Figure 4. Overview of the formal model
3 Analysis
3.1 Informative feedback mappings
In formulating design principles, it is important to have a clear benchmark for the objective being served.
The objective of a market feedback system is to provide ratings that are informative, in the sense that a
trader’s ratings can forecast value for that trader’s future performance. We investigate the informativeness
of our 4 feedback systems by testing the extent to which a platform can predict seller types given the sellers
feedback scores.
In this section, we describe the informative feedback mapping from quality to ratings for our two rating
systems 𝑅v and 𝑅=. We use concepts from information theory to measure the informativeness of the rating
systems about the seller types (see Cover and Thomas, 2006, for an overview of information theory and
Veldkamp, 2011, for an overview of how informativeness is measured in economics and finance). The
uncertainty of a random variable 𝑋 with probability mass function 𝑝(𝑥) is measured by its entropy 𝐻(𝑋) =
Message space!"
Feedback aggregation rule
#(%)
Feedback information
display '
Mapping ("
Belief updating)*(')
Seller types ΘBid ,
Quality -Rating %
Feedback score
Platform Buyers Seller
12
−∑ 𝑝(𝑥) logu 𝑝(𝑥)�∈� .5 Entropy is 0 if 𝑝(𝑥K) = 1 for exactly one outcome 𝑖 and 𝑝\𝑥X] = 0 for all other
outcomes, and it is at its maximum when all 𝑝(𝑥) are equal. In our setup, the entropy of the prior beliefs
about the seller types 𝐻(Θ) = −∑ 𝜇(𝜃) logu 𝜇(𝜃);∈c ≈ 1.58, which is the maximum entropy of a random
variable with three outcomes. When the beliefs are updated due to new information about the seller’s type
in the form of the feedback score, the entropy decreases. In this sense, when beliefs about the seller type
converge to the true type, entropy goes to 0.
In information theory, the concept of relative entropy, or the Kullback-Leibler distance, is used to measure
the difference between two probability distributions 𝑝(𝑥) and 𝑞(𝑥) of the same random variable 𝑋.6 It is
defined as 𝑑(𝑝||𝑞) = ∑ 𝑝(𝑥) logu �(�)�(�)�∈� . Relative entropy “is a measure of the inefficiency of assuming
that the distribution is 𝑞 when the true distribution is 𝑝” (Cover and Thomas, 2006, p. 19). We use the
expected relative entropy over the possible signals
𝐸�|��[𝑑\𝛽(𝜃|𝑍P)��𝜇(𝜃)]� = ∑ 𝜁L^ ∑ 𝛽(𝜃|𝑧) logu�\𝜃�𝑧]f(;);∈c^∈�|�� . (3)
to measure the expected informativeness of an information structure 𝑆F, i.e., the expected improvement of
using the updated beliefs after observing the feedback score compared to using the prior beliefs about the
seller type. The expected relative entropy is also called the mutual information between the signal and the
seller type.
We compute the expected relative entropy explicitly for some mapping matrices as an informativeness
benchmark. We consider only mappings where the rating is weakly increasing in quality. Figure 5 shows
the expected relative entropy conditional on the number of feedback ratings received. For the rating system
𝑅= using the identity matrix 𝑀= = 𝐼= is an optimal mapping, as each quality level is submitted to the
platform without noise.7 In the rating system 𝑅v this is no longer possible as there are fewer ratings available
than there are quality levels. We benchmark the following mappings:8
5 By convention, 0 log 0 = 0, which is justified by a continuity argument as 𝑥 log 𝑥 → 0 as 𝑥 → 0. It is common to use the base 2 logarithm such that entropy is measured in bits. Other bases would only change the unit of measurement. 6 Again, by convention, 0 log �
�= 0, 0 log �
�= 0, and 𝑝 log �
�= ∞. Using the natural logarithm, relative entropy
corresponds to the expected value of the logarithm of the likelihood ratio. 7 The column-reversed identity matrix with 1 on the counterdiagonal and 0 otherwise would also be an optimally informative mapping in our setup. 8 Due to the symmetry in our sellers, the column-reversed mappings of these matrices lead to the same expected relative entropies.
13
𝑀v2 =
⎝
⎜⎛1 0 00 1 00 1 00 1 00 0 1⎠
⎟⎞, 𝑀v
3 =
⎝
⎜⎛1 0 01 0 00 1 00 0 10 0 1⎠
⎟⎞, 𝑀v
� =
⎝
⎜⎛1 0 00 1 00 1 00 0 10 0 1⎠
⎟⎞,𝑀v
� =
⎝
⎜⎛1 0 01 0 01 0 00 1 00 0 1⎠
⎟⎞
As can be seen in Figure 5, the matrix 𝑀v2 has the largest expected relative entropy of the mappings under
consideration for the amount of received ratings that can be observed in our experiment. Hence, we consider
the identity matrix 𝑀= and the matrix 𝑀v2 as the optimal benchmark mappings for the analysis of feedback
system informativeness in Section 3.3.
Figure 5: Expected relative entropy for the weakly increasing mappings described in the text, conditional
on the number of feedback ratings received.
3.2 Descriptive summary and data patterns
Table 1 gives an overview of the results in our four treatments. The transaction rates and average earnings
are higher in the treatments with the 𝑅= rating system. Within the treatments, transaction rates and profits
are higher in the treatment with the unspecific feedback elicitation question.
14
Treatment Observations Subjects Transaction rate Average Earnings
3-point scale, rate quality (R3-Q) 3600 60 70.2% 19.62€
3-point scale, rate transaction (R3-U) 3240 54 72.5% 19.95€
5-point scale, rate quality (R5-Q) 3600 60 73.1% 20.12€
5-point scale, rate transaction (R5-U) 3240 54 75.2% 20.17€
Table 1: Summary statistics across experimental treatments.
Figure 6a: The evolution of feedback scores for each of the 6 sellers across the 60 periods, and the 6 matching groups
(dashed lines). Feedback scores were normalized on the quality scale [20%, 100%], to be comparable across the 4
treatments. The golden line represents seller’s expected quality given the theoretical distribution presented at the
beginning of the experiment, and the thick grey line represents feedback score (the average across all feedback ratings).
6b: Feedback score distributions at the last period across the four treatments.
Figure 6a shows the average feedback scores of the six sellers across the 60 periods. The “5-point, rate
quality” treatment exhibits the least variation in feedback scores, allowing subjects to make more precise
inferences about the seller types, the key to solving the adverse selection problem inherent in this market.
Figure 6b suggests that at period 60, subjects are able to best distinguish between the three types of players
in the “5-point, rate quality” treatment as the feedback scores are more separated.
15
Figure 7: Rating distribution conditional on quality: feedback ratings given by buyers after receiving a good with a
certain quality level.
Going back to the question presented in the introduction, although the feedback score seems to closely
follow quality received, there is not an exact one to one mapping between quality received and the feedback
rating given by buyers (see figure 7). In the treatments with the 𝑅= rating system, the most common rating
for each level of quality is indeed the one prescribed by the matrix 𝑀=, but there are considerable deviations,
the more so in the R5 – U treatment. In the treatments where there are only three rating levels available, the
ratings for 20%, 60%, and 100% of quality are most often 1, 2, and 3, respectively, while there is again
more variance in the treatment with the general feedback elicitation question. Quality of 80% is most often
rated with a 3, although there is a considerable amount of ratings of 2. The largest difference between the
two R3 treatments occurs at a quality level of 40%: In R3 – Q, the most frequent rating is a 2, while it is a
3 in the R3 – U treatment. The mapping in R3 – U resembles the matrix 𝑀v3, while the mapping in R3 – Q
is more of a mix between matrix 𝑀v3 and 𝑀v
�.
This begs the question of identifying the determinants of buyer feedback giving behavior, which we tackle
in Section 4.
16
3.3 Feedback system informativeness
We use two approaches to compare the informativeness of the ratings in the four treatments. The first
approach is based on the expected relative entropy given the empirical mappings from the four treatments
(see Figure 8). This approach is tied closely to the theoretical analysis above. We compute for each
treatment and for each possible number of ratings the expected relative entropy for the mapping matrices
the subjects used in the experiment. Two observations can be made. First, within the R5 and the R3
treatments, the mappings used in the treatments where the feedback elicitation question refers to quality
exhibit higher informativeness. Second, the informativeness is lower in the R3 compared to the R5
treatments.
Figure 8: Expected relative entropy for the optimally informative mapping and the mappings used by subjects in the
experiment, conditional on the number of feedback ratings received.
The second approach is based on a multinomial logit model where the probability of each seller type is
predicted by the seller’s feedback score, while controlling for number of ratings a seller received, the game
period, and an interaction term between the period and feedback score. To characterize the amount of
information loss induced by each feedback system, we compute the relative entropy of the predicted
probabilities for the three seller types.
At every period t, we compute the relative entropy per treatment, and per transaction opportunity, where
𝑦K={A, B, C} are the three seller types at every transaction opportunity, and 𝑥K are the independent variables
mentioned above, including an intercept. The data include the entire history of seller feedback scores up to
period t. We then average the relative entropy across transactions to have an aggregate measure per period,
17
which we plot in Figure 9 below. Intuitively, the larger the relative entropy, the easier it is to distinguish
between the three seller types based on their feedback scores, making the system more informative. Both
approaches yield the same informativeness ranking of the four treatments.
Figure 9: Relative entropy per treatment
The most informative system, i.e. the system leading to lowest entropy, is the “5-point, rate quality”
treatment. This supports our suggestion to design a feedback system with a clear mapping between the
feedback rating scale and the feedback elicitation question, linked to the dimension deemed most
informative for the feedback system, which in our case is quality. Remarkably, the feedback scores in the
R3 – Q treatment are the second-most informative. We see that as long as the feedback elicitation is directed
towards the informative signal, quality in our setup, a coarser scale is not problematic per se. However, if
the feedback elicitation question is unspecific, informativeness drops in both rating systems.
Still, as seen in Figure 8, the informativeness in all four treatments is lower than theoretically possible. The
reason is the noisy feedback giving that does not follow the optimal mappings outlined in the section above.
In the next section, we analyze the determinants of buyers’ feedback giving to better understand why and
how feedback giving diverges from the most informative behavior.
18
4 Understanding the drivers of buyer’s behavior
4.1 A hidden Markov mixture of experts model to unravel buyers’ unobserved feedback giving
rules
Feedback systems are believed to function in an equilibrium state where all players coordinate on the
feedback rule. But equilibrium is the end state reached via an iterative process that unfolds over time. We
aim to uncover the underlying feedback rules used by the players, the evolution of rules use and the degree
of state dependence. Studies have focused on the drivers of feedback giving rather than on the rules buyers
use to leave feedback (Gutt et al. 2018). The idea of rule switching has been documented in learning
behavior (Salmon 2004, Stahl 2001, Ansari et al. 2012).
The descriptives presented in the above section suggest that quality is a major determinant of the level of
feedback rating. However, other factors also seem to play a role in feedback giving. In the feedback stage
of our experiment, buyers are informed about their profit from the transaction, in addition to quality
received, the seller feedback score and the transaction outcome. Therefore, subjects might use a profit-
based rule to leave feedback. For the same level of quality received, players might give a lower rating if
their profit is low or even negative, or a higher rating if their profit is large. Given that our experiment
involved multiple periods, any attempt to model feedback giving rules will need to account for the
possibility that subjects switch between feedback rules throughout the 60 periods of the experiment,
assuming multiple feedback rules exist. The dynamics in feedback rule adoption can depend on the
feedback system design, and on players’ experience in previous rounds.
The feedback rule used by a player in a certain round is not observable. We can only observe the rating
players leave. We implement a non-homogeneous hidden Markov mixture of experts model (NH-HMME)
to reveal players transitions between leaving feedback based on profit vs. based on quality received over
the periods of the game. In our context, the “experts” are the two different feedback rules, the quality vs.
the profit rule. Players leave ratings that are conditional on, thus probabilistically aligned with the feedback
rule chosen at every period. The approach is popular in machine learning (need cites!!) and has been
introduced in economics to study learning rules dynamics. Ansari et al. (2012) studied how subjects switch
between reinforcement and belief learning in six games. A hidden Markov model (HMM) is more
commonly used to unravel latent dynamics in behavior. Instead of considering that the choice of rating is
driven by the profit rule or the quality rule, an HMM would assume that the conditional ratings distributions
come from the same model, i.e. the same family of distributions, but with different parameters. In our setup,
such a model is not identifiable because quality and profit are highly correlated (0.864 across all treatments),
which would lead to a highly colinear model were we to introduce them as explanatory variables in the
19
same model. Therefore, we propose a model where the two are mutually exclusive at each period, and
ratings can be consistent with a quality-based feedback rule or profit-based feedback rule. Subjects can
switch between feedback rules after every period of the auction, following a first order Markov decision
process governed by a full transition matrix.
The non-homogeneous nature of the model allows for time-varying transition propensities between
feedback rules, likely influenced by experimental outcomes. For instance, disappointing outcomes such as
a loss (negative profit), in the previous period of the auction can motivate subjects to switch states, and
change their feedback giving rule.
Figure 10: The hidden Markov mixture of experts model of feedback giving rules
Model specification: The HMME model has 3 components:
1. Start probabilities: the initial state membership probability, 𝜋K , showing player i’s probability to
give feedback based on profit vs. quality rule s, in period 1.
2. (Non-homogeneous) transition probabilities ΛKP: the likelihood of transitioning between feedback
rules s in period t. These probabilities are time varying, i.e. non-homogeneous, and depend on the
game outcomes in the previous period t-1.
3. Emission probabilities 𝑃K P(𝑌X = 𝑗|𝑆KP = 𝑠): the feedback rating choice probabilities conditional on
the state (feedback rule) chosen by the subject.
Initial state probabilities: Let 𝑠 = 𝑄, 𝑃 denote a latent feedback rule state, where 𝑠 = 𝑄 if the subject gives
feedback according to the quality rule, and 𝑠 = 𝑃 if the subject uses the profit rule. The probability that a
player is initially in state s, i.e. the initial state membership, is given by the following simplex:
𝜋 = ¥𝜋¦, 𝜋§�, 𝑠. 𝑡. 𝛴 𝜋 = 1; 𝜋¦ = 1/(1 + exp(−𝛾)) (4)
20
We reparametrize the start probabilities using an inverse-logit transformation.
Transition probabilities: The transition probability shows how subjects’ transition between feedback states,
at period t:
ΛKP = ¯𝜆KP¦¦ 𝜆KP¦§ = 1 − 𝜆KP¦¦
𝜆KP§¦ = 1 − 𝜆KP§§ 𝜆KP§§± (5)
where 𝛴 𝜆KP ² = 1. Each element of the matrix 𝑞KP ² shows the probability of a player to transition from
feedback rule s’ at period t-1, to feedback rule s in period t. The players’ probability to transition from one
state to another is impacted by observed outcomes, and by their intrinsic tendencies to transition, captured
by the intercepts in the equations below:
𝜆KP¦¦ =³��(´µ`¶µ�·¸)<`³��(´µ`¶µ�·¸)
; 𝜆KP§§ =³��(¹º`»º�·¸)
<`³��(¹º`»º�·¸). (6)
where 𝜌 represents player’s tendency to remain in a certain state from one period to another, and 𝜏
captures the impact of game outcomes on the propensity to keep using the same rule. 𝑥KP are the time varying
covariates likely to impact switching between rules. We explore different game outcomes that could impact
the transition matrix, which we mention in the empirical estimation section. Note that most outcomes are
revealed in the same period t, before the feedback elicitation question, thus can impact the choice of
feedback rule in that period.
Emission probabilities: Conditional on using a certain rule s at period t, the probability of rating j is given
by a multinomial logit specification. Subjects can leave feedback on a three-point scale j={1,2,3} in two
treatments, and on a five-point scale j={1 to 5} in the remaining two treatments. A subject i leaves a rating
j at time t with probability 𝑃K P(𝑌X = 𝑗|𝑆KP = 𝑠), conditional on the feedback rule s used:
𝑃𝑟K P(𝑌X = 𝑗|𝑆KP = 𝑠) =³��(�½¾¿·¸
À)Á�Â7:ij��(��¾¿·¸
À). (7)
For identification purposes, we set all 𝛽< = 0, and interpret log-odds against the outcome feedback j = 1.
The time-varying set of explanatory variables 𝐾KP differs given the state s. The experts in our HMME model
are the two feedback giving rules. If a player gives feedback according to the quality rule, then 𝐾KP¦ =
[1, 𝑄KP/20,𝐹𝑒𝑒𝑑𝑏𝑎𝑐𝑘𝑠𝑐𝑜𝑟𝑒KP]. If a player gives feedback according to the profit rule, their feedback is
explained by their profit at period t, such that 𝐾KP§ = [1, ΠKP/10, 𝐹𝑒𝑒𝑑𝑏𝑎𝑐𝑘𝑠𝑐𝑜𝑟𝑒KP]. We divide quality by
20 to match the feedback and feedback score range, and profit by 10 to ensure better sampling of the
parameters reflecting the impact of profit on the feedback given.
21
Likelihood: The likelihood of observing a sequence of feedback rules and conditional ratings over
the T periods of the experiment, written in matrix form, is:
𝐿KÌ\𝑆K(1), … . . 𝑆K(𝑇)] = 𝝅𝒔²𝑷𝒓𝒊𝒔𝟏ΠPMuÌ 𝚲𝒊𝒕𝑷𝒓𝒊𝒔𝒕𝟏. (8)
where 𝝅𝒔² is given by equation 4,𝚲𝒊𝒕 is the transition matrix in equation 5, 𝑷𝒓𝒊𝒔𝒕 is a diagonal matrix with
state specific probabilities given by equation 7, and 𝟏 is a vector of ones.
Model selection: We estimate our model using Hamiltonian Monte Carlo simulation (Gelman et al. 2014)
in a hierarchical Bayesian framework, and implement forward filtering techniques following the likelihood
function in equation 8. We use measures of model fit and predictive accuracy to establish whether the
proposed NH-HMME is the best suited model despite its complexity9. In Table 2, we report log-predictive
density (LPD) and the Watanabe-Akaike Information Criterion (WAIC), which both account for model fit
and model complexity10. We also compute the mean-square error (MSE) and hit rates.11
log-predictive density
WAIC
Model R3U R3Q R5U R5Q R3U R3Q R5U R5Q
Quality model -1,270 -966 -2,353 -1,985 2,537 1,924 4,699 3,928
Profit model -1,508 -1,344 -2,613 -2,604 3,019 2,694 5,230 5,290
HMME -1,070 -926 -2,380 -2,131 2,053 1,769 5,570 5,509
NH-HMME -1,068 -922 -1,963 -1,687 2,048 1,749 3,644 2,979
MSE
HIT RATES (%)
Model R3U R3Q R5U R5Q R3U R3Q R5U R5Q
Quality model 0.0988 0.0505 0.1478 0.1478 68.59 77.49 61.55 61.48
Profit model 0.1537 0.1102 0.3245 0.2991 60.78 66.79 43.05 45.28
HMME 0.0635 0.0409 0.0909 0.0407 74.80 79.77 70.00 79.86
NH-HMME 0.0645 0.0410 0.1763 0.1120 74.59 79.76 58.06 66.54
Table 2: Model fit and predictive accuracy. Note that the HMME model does not converge well for the 5-star
treatments, further supporting the need for a non-homogeneous model.
9 In appendix B we provide the results of a simulation study, showing model parameter recovery and hidden states estimation stability. 10 The log predictive density (Gelman 2014) is computed as the posterior mean of the log-likelihood function evaluated in each draw of the HMC sampler. Note that LPD is based on the forward-filtering probabilities, while WAIC is based on Viterbi decoding. 11 For each individual at each click, the squared error is computed as the square of the difference between the rating probability, and the actual behavior (j = {1,2,3} or j={1 to 5}). The mean is computed across visitors and clicks. Based on the above predicted probabilities, we draw a rating j and compute the percentage of correctly predicted ratings at every iteration of the HMC. We report the average hit rates across iterations. We expect hit rates to be lower in the “5 star” treatments than in the “3 star” treatments, since it is harder to predict a choice among 5 options rather than 3 options.
22
Results show that the proposed NH-HMME model outperforms a series of nested models, such as a model
where feedback is explained the quality rule only, another model where feedback is explained profit rule
only12, and a HMME where transition matrices are not time-varying. The likelihood-based measures (log-
predictive densities and WAIC) strongly support the NH-HMME model, while the evidence is mixed when
looking at non-likelihood based measures (hit rates and mean square errors). Although the non-likelihood
based measures seem to support the HMME model, the model does not converge well in the “5-point”
treatments data (the Gelman-Rubin statistic 𝑅Ö > 1.1). We therefore conclude that the non-homogeneous
model is most appropriate in this setup.
Initial state probabilities: We report the results of the NH-HMME model. We report substantive results
highlighting players’ feedback rule dynamic here, and report all supporting results, including all parameter
estimates and their 95% highest density intervals in Appendix C. The experimental treatments can impact
buyers’ likelihood to decide to initially give feedback based on the profit vs. the quality rule. Before they
start the auction, players are instructed to leave feedback on either a 3-point or a 5-point scale, and that they
should rate the quality (directed feedback) or the transaction (general feedback).
The initial state probabilities reported in Table 3 show that, in all treatments, subjects are more likely to
start by giving feedback based on quality received. When asked to leave a rating (general feedback),
subjects are more likely to initially use the profit rule (.284 and .362 for the 3-point scale and the 5-point
scale treatment respectively), than when asked to rate the quality received (.19 and .194 for the 3-point scale
and the 5-point scale treatment respectively). Interestingly, subjects are more likely to initiate the game
using the profit rule in the “5-point, rate transaction” treatment than in the “3-point, rate transaction”
treatment, even though in the “5-point, rate transaction” treatment the quality scale matches the feedback
scale, and a fluency argument (cite psych paper on fluency) would suggest that it is easier for subjects to
rate quality received. As expected, the overall probability to initially use the quality rule is highest in the
directed feedback treatments, where subjects are asked to rate the quality received.
State transitions: To illustrate the average feedback rule dynamics, and the impact of game outcomes on
the probability to transition from feedback rule s to feedback rule s’, we report in Table 3 the transition
matrices at high, average and low levels of the feedback score.
To test whether game outcomes impact state transitions, we considered the impact of several variables on
the transition probabilities. We considered the effect of whether the buyer made a loss in the current
12 We also estimated a hidden Markov model (HMM) where quality and profit are both integrated in the emission equations, and their coefficients are allowed to vary by state. As explained, due to the high correlation between the quality and profit, the model is not well-identified, and the Gelman-Rubin statistic shows that the model does not converge ( 𝑅Ö > 1.1). The results are available upon request.
23
period13, buyers’ expected payoffs, expected quality, the differences between actual and expected payoffs,
and the differences between expected quality and quality received. Game outcomes have limited but
significant impact to induce a transition between rules. The best fitting model is one where sellers feedback
score available to buyers before the transaction occurs explains a state transition, but the effects are rather
small (see Appendix C). Note that we also control for feedback score in the emission probability model.
This suggests that feedback score has long-term impact on buyers’ behavior, impacting their transitions
among feedback rules, as well as a short-term effect, impacting the ratings given, conditional on the chosen
feedback rule at any period.
Looking at Table 3, the high values of the diagonal elements in the average transition matrices, most above
90%, suggest that states are very sticky, and players are likely to repeat the feedback rule used in previous
periods of the game. Players rarely transition from the quality to the profit rule, but are slightly more likely
to revert to the quality rule if they previously used the profit rule as a basis for their ratings. Players are
most likely to switch from the profit to the quality rule in the directed feedback treatments, particularly in
the “5-point, rate quality” condition (0.081 vs. approx. 0.03 in all other treatments).
Across all treatments, there seems to be a higher variability in feedback rule dynamics when feedback
scores are low, compared to average or high feedback sores.
Transition matrix
Baseline: Average
feedback score
With Low
feedback score
With high
feedback score
Average start probabilities
Condition (t-1) to t Q P Q P Q P
3-point, rate transaction Q 0.9996 0.0004 0.9964 0.0036 1 0 0.716
P 0.0333 0.9667 0.0494 0.9506 0.0224 0.9776 0.284
3-point, rate quality Q 0.997 0.003 0.9818 0.0182 0.9953 0.0047 0.81
P 0.0258 0.9742 0.1073 0.8927 0.0372 0.9628 0.19
5-point, rate transaction Q 0.9947 0.0053 0.9843 0.0157 0.9982 0.0018 0.638
P 0.0245 0.9755 0.0584 0.9416 0.0101 0.9899 0.362
5-point, rate quality Q 0.9983 0.0017 0.9912 0.0088 0.9997 0.0003 0.806
P 0.0809 0.9191 0.095 0.905 0.0687 0.9313 0.194
Table 3: Average start probabilities, transition matrices, and changes in transition probabilities as a function of sellers’
feedback score, across treatments. States seem particularly sticky. Note: Feedback scores used to compute the
13 Since players are informed about a seller feedback score, quality received and their profits before they are asked to leave a rating, current profits might influence players to switch between feedback rules.
24
transition matrices are: FS={low=1.2, average=2, high=2.8} for the 3-point scale treatments, and FS={low=2,
average=3, high=4} for the 5-point scale treatment.
Figure 11: Population level smoothed state probabilities across the 60 periods of the game. The plot shows population
level feedback rule dynamics. Note that transition matrices are time-varying, depending on feedback score.
In Figure 11, we plot the population level smoothed state probabilities across the 60 periods of the game.
The plot shows population level feedback rule dynamics. transition matrices are allowed to vary over time,
as a function of sellers’ feedback score. The plot highlights the need to use a time-varying, i.e. non-
homogeneous model to account for the impact of game outcomes. As seen in Figure 11, there are substantial
differences in feedback rule dynamics across treatments. In the “5-point, rate quality” and the “3-point, rate
transaction” treatments exhibit stronger dynamics in feedback rules than in the “5-point, rate transaction”
and the “3-point, rate quality” treatments.
In the “5-point, rate quality” treatment in particular, subjects seem to learn to use the quality rule throughout
the game, as the profit rule is much less sticky at 92%, than the quality rule, which hovers around 99%.
Therefore, once users switch to the quality rule, they are far less likely to switch back to the profit rule.
State-specific behavior. We report the parameter estimates of the emission probability models in Appendix
C. The parameter estimates are in line with our expectations, and show that quality and profits have a
positive impact on feedback scores. The higher the quality received and the higher the profit, the more
likely we are to observe a higher rating vs. a rating of one (our baseline). Conditional on feedback rule, we
25
computed probabilities of giving a specific feedback level, using a multinomial logit model, and plot the
results in figures 12 and 13.
Figure 12: Predicted rating probabilities at different levels of quality and feedback score, conditional on using the
quality rule, across all treatments.
Feedback analysis conditional on the quality rule. Subjects in both “5-star” treatments seem to follow very
similar strategies (blue and red lines very close at most levels of the feedback score and quality, in Figure
12). This is also supported by the fact that most of the 95% HDIs of the parameter estimates overlap (see
Appendix C). When comparing the use of the quality feedback rule across the “3-star” treatments,
surprisingly, in the “3-star, rate quality” treatment, ratings are more impacted by feedback scores than in
the “3-star, rate transaction” treatment.
Feedback score has a much smaller impact on ratings than quality received, and makes more of a difference
at low and intermediate levels of quality received. Interestingly, the feedback score coefficient is negative,
and it seems to accommodate a disappointment or punishment effect: when quality received is low, the
higher the feedback score the lower the feedback given; when quality received is high, the impact of a
seller’s feedback score on rating given is much smaller.
In the “5-star” treatments, when the feedback score was lower than the quality received (quality exceeds
expectations), the average likelihood of giving a feedback of one level below quality received is 0.065 for
26
the directed feedback condition, and 0.067 for general feedback condition. When the feedback score was
higher than the quality received (subjects are disappointed with the quality received), the average likelihood
of giving a feedback of one level below quality received is 0.242, and 0.245 for directed vs. general
feedback elicitations, respectively. When the feedback score matches quality received, the average
likelihood of giving a feedback of one level below quality received is 0.109, and 0.111, for directed vs.
general feedback respectively. The disappointment of not receiving the expected quality (as suggested by
the seller’s feedback score) seems to lead subjects to punish sellers by leaving lower ratings. If subjects
were to Bayesian update, they should consider the feedback score as a prior and give sellers a higher
feedback rating than the quality received. We should therefore observe the opposite trend.
The same analysis for the 3-star treatments requires an extra step to impose an optimal mapping between
feedback rating and the quality received, and infer which rating is lower than warranted by the quality
received. Nonetheless, Figure 12 appears to supports the same trend for the “3-star, rate quality” treatment,
but not for the “3-star, rate transaction” treatment. Looking at the “3-star, rate quality” treatment, when
subjects receive 40% or 50% quality (2 or 3), they seem to consider feedback score to guide their rating.
When quality received is 40% (2), and feedback score is 3, subjects are much more likely to give a feedback
of 1 (.75), suggesting a punishment effect due to subjects’ disappointment from receiving a less than
expected quality. When the reverse is true, and quality received is 2 whereas feedback score is 1, subjects
become very lenient, and are 75% likely to leave a rating of 2 instead of a rating of 1. When feedback score
matches quality received at 2, the likelihood of rating is evenly split between 1 and 2. Feedback score does
not make a difference when quality received is 80%. Regardless of feedback score, subjects are 75% to
80% likely to leave a rating of 3, irrespective of the treatment.
27
Figure 13: Predicted rating probabilities at different levels of profit and feedback score, conditional on using the profit
rule, across all treatments.
Feedback analysis conditional on the profit rule. Sellers’ feedback score seems to be more impactful when
players decide to use the profit rule, and has a positive impact on feedback. Most parameters estimates are
either significantly different than 0 (the 95% HDIs do not contain 0), or directionally higher than 0 (see
Appendix C). The higher the feedback scores, the higher the likelihood of leaving a higher rating, for the
same level of profit, particularly in the “Rate transaction” treatments.
Overall, results show that feedback rule dynamics are strongly impacted by initial state probabilities, and
to some degree by game outcomes altering state transition probabilities. Start probabilities are likely
impacted by the feedback system design. In the initial period, the design of the market is their only available
information. To be successful, feedback systems must be designed to induce traders to give accurate and
useful feedback. When the feedback scale and the feedback elicitation question are mapped on the same
dimension, as in our “5-point, rate quality” treatment, buyers are able to focus their feedback ratings on the
dimension that is most diagnostic of seller performance.
28
5 Discussion and future directions
In this paper, we aim to uncover principles towards the design of an informative feedback system. A
feedback system is characterized by its information structure, including the message space, the feedback
aggregation function and the mapping function used by buyers to rate sellers’ performance and hence
depends on decisions made by the platform and by market participants. We implement an experimental
design where we vary two components of the feedback system: the feedback elicitation question and the
rating scale. The feedback elicitation question asks for directed (rate quality) vs. general feedback (rate
transaction). The rating scale accommodates designs typically used in practice, with a 5-point scale that in
practice implies a rating of product quality, and a 3-point scale that usually implies a negative, neutral or
positive experience of the interaction with the seller. Our two-by-two full factorial design is aimed to mimic
four online market designs, in which the feedback elicitation question is mapped or not into the most
predictive dimension of seller performance.
We choose to implement a lab experiment over seeking other types of data to study feedback system design
because of the tight control of lab experiments over field experiments or observational data. One key feature
of our experimental market is the compulsory feedback, intended to limit any selection bias introduced by
traders’ choice to not leave feedback following a transaction. This is crucial, as literature has shown that
the decision to abstain from leaving feedback is not random but driven by transaction experience (Bolton
et al. 2018, Dellarocas and Wood 2008), or social concerns. We avoid any confounding effects on
information transmission by imposing all traders to leave feedback after each transaction. In our
experiment, traders cannot communicate with each other, reducing any concerns related to empathy or
social pressure. Additionally, subjects are fully informed about the distribution of seller types, reducing
buyer’s uncertainty w.r.t. to expected quality, compared to an online platform. Although artificial in several
respects, our experiment retains the key features of market design relevant in practice. Buyers have private
valuations for the traded goods, can make profits or losses from a transaction, and feedback scores are
highly informative for buyers’ behavior.
Sound principles for the design of market feedback systems are principles that optimize the informativeness
of the system by focusing on the important forecast variables. Our results show that when feedback
elicitation question and the feedback scale align on a common dimension deemed most relevant on a
particular marketplace, the feedback system is most informative, able to forecast sellers’ performance.
This main conclusion is supported by an informativeness analysis, followed by a second analysis which
uncovers the buyers feedback giving rules and the drivers of rating behavior. We use two approaches to
analyze system informativeness measured by relative entropy, which measures the improvement of using
29
feedback information to form beliefs instead of using prior beliefs about the seller typos. The first is based
on the theoretically expected belief distribution given our model across the four market designs, using the
empirical mappings of quality into feedback score. The second approach estimates the belief distributions
directly from the experimental data available to the subjects using the multinomial regressions. Both
methods reveal that the most informative feedback system is the directed “5-point, rate quality” treatment,
which maps the feedback elicitation question and feedback rating scale into the quality scale.
Our empirical analysis further reveals the dynamics in the underlying feedback giving rules used by buyers
throughout the transaction periods. Buyers transition between leaving feedback based on the quality
received and based on their profits from a transaction, and the transition propensities are influenced by
sellers’ feedback score. Interestingly, even after accounting for its long-term impact on the chosen feedback
rule, feedback score seems to impact sellers’ ratings in the short run. This is suboptimal, as buyers should
only use the quality received as a proxy for seller performance when leaving a rating after the transaction
has occurred (assuming a transaction occurred). Feedback scores should only be used for deciding on the
bid. Our analysis suggests that feedback score seems to accommodate a buyers’ disappointment of receiving
a low quality. Buyers seem to punish sellers who perform worse than the feedback score would suggest, by
leaving a lower rating than expected given the quality shipped by the seller.
Our most informative system, i.e. the system leading to largest relative entropy, is the “5-point, rate quality”
treatment. This supports our suggestion to design a feedback system with a clear mapping between the
feedback rating scale and the feedback elicitation question, linked to the dimension deemed most
informative for the feedback system, which in our case is quality. A fluent design allows people to easily,
almost automatically, leave feedback on desired dimension. The system with the most fluent design leads
to more accurate feedback. The results are supported by phycological studies which highlight that fluency
leads to ease of processing information (Alter and Oppenheimer 2009), and significantly increases users’
experience with an online system.
Online platforms can readily implement the conclusions of our analysis by identifying the dimension most
diagnostic of seller performance, and designing a fluent feedback system that predicts seller’s performance
on that dimension. As future buyers are aware of the feedback system, further studies could test whether
the system alignment also leads them to a priori interpret sellers’ feedback scores on the desired dimension,
before even the first transaction. Our experimental design offers a high degree of flexibility, and would
allow other researchers to test various feedback systems under specific setups.
Our analysis involves computerized sellers. As next steps, we would integrate human sellers that can
strategically decide quality shipped in order to maximize profits. Such an extension of our experimental
30
design to a moral hazard setting would allow us to analyze whether the feedback system design features
found to be optimal in the current adverse selection setup would also be consequential in a moral hazard
setting. Otherwise, there may a need to provide incentives for trustworthiness due to seller’s strategic
behavior, and adapt the feedback system to account for such behavior.
31
References. Alter AL, Oppenheimer DM (2009). Uniting the Tribes of Fluency to Form a Metacognitive Nation.
Personality and Social Psychology Review 13(3), 219-235.
Ansari, A., Montoya, R., Netzer, O. (2012). Dynamic Learning in Behavioral Games: A Hidden Markov
Mixture of Experts Approach. Quantitative Marketing and Economics 10 (4), 475-503.
Bajari P., Hortacsu A. (2004). Economic insights from internet auctions. Journal of Economic Literature,
42(2), 457–486.
Bauerly, R.J., (2009). Online auction fraud and eBay. Marketing Management Journal 19, 133-143.
Becker, G.M, DeGroot, M.H., Marschak, J. (1964). Measuring Utility by a Single-Response Sequential
Method. Behavioral Science, 9(3), 226-232.
Bolton, G.E., Greiner, B., Ockenfels, A., (2013). Engineering Trust - Reciprocity in the Production of
Reputation Information. Management Science 59, 265–285.
Bolton, G.M., Kusterer, D.J., Mans, J., (forthcoming). Inflated reputations, Uncertainty, Leniency and
Moral Wiggle Room in Trader Feedback Systems. Management Science.
Cokely, E. T., Galesic, M., Schulz, E., Ghazal, S., Garcia-Retamero, R. (2012). Measuring risk literacy:
The berlin numeracy test. Judgment and Decision Making, 7(1), 25–47.
Cover, T. M. Thomas, J. A. (2006). Elements of Information Theory. Wiley-Interscience, Second edition.
Crosetto, P., Filippin, A. (2013). The “bomb” risk elicitation task. Journal of Risk and Uncertainty, 47(1),
31– 65.
Dellarocas, C., Wood, C.A., (2008). The Sound of Silence in Online Feedback: Estimating Trading Risks
in the Presence of Reporting Bias. Management Science 54, 460–476.
Gelman, A., Carlin, J., Stern, H., Dunson, D., Vehtari, A., Rubin, D. (2014). Bayesian Data Analysis, CRC
Press, 3rd edition.
Golman, R., Bhatia, S. (2012). Performance evaluation inflation and compression. Accounting,
Organizations and Society, 37(8), 534–543.
Gregg, D. G., Scott, J. E. (2006). The role of reputation systems in reducing on-line auction fraud.
International Journal of Electronic Commerce, 10(3), 95–120.
32
Gutt, D., Neumann, J., Zimmermann, S., Kundisch, D., Chen, J. (2018). Design of review systems – a
strategic instrument to shape online review behavior and economic outcomes. Working paper.
Kauffman, R. J., Wood, C. A. (2006). Doing their bidding: An empirical examination of factors that affect
a buyer’s utility in internet auctions. Information Technology and Management, 7(3), 171–190.
Nosko, C., Tadelis, S. (2015). The limits of reputation in platform markets: An empirical analysis and field
experiment. NBER Working Paper No. 20830.
Payne, J. W., Bettman, J. R., & Johnson, E. J. (1988). Adaptive strategy selection in decision
making. Journal of Experimental Psychology: Learning, Memory, and Cognition, 14(3), 534-552.
Resnick, P., Zeckhauser, R., Swanson, J., Lockwood, K. (2006). The value of reputation on eBay: A
controlled experiment. Experimental Economics, 9(2), 79–101.
Roth, A.E., (2002). The Economist as Engineer: Game Theory, Experimentation, and Computation as Tools
for Design Economics. Econometrica 70, 1341–1378.
Tadelis, S., (2016). Reputation and Feedback Systems in Online Platform Markets. Annual Review of
Economics 8, 321–340.
Salmon, T. (2004). Evidence for learning to learn behavior in normal form games. Theory and Decision,
56(4), 367–404.
SoPHIE labs: https://www.sophielabs.com
Stahl, D. (2001). Population rule learning in symmetric normal-form games: theory and evidence. Journal
of Economic Behavior and Organization, 1304, 1–17.
Veldkamp, L. L. (2011). Information choice in macroeconomics and finance. Princeton University Press.
33
Appendices
Appendix A. Details of the experiment
1.1. Screenshots of the simulation used to demonstrate that it is optimal for bidders to bid their valuation.
34
35
Appendix B: Simulation study
We report a simulation study to assess the empirical identification of the non-homogeneous HMME model
parameters. We generate data very similar to our experimental data, to keep the simulated data as close as
possible to the empirical application. To that purpose, we use the independent variables from our empirical
dataset, from the “3-star: rate transaction” treatment. We simulate 54 individuals, who participate in
auctions for 30 periods. The levels of quality received, buyer valuations for the traded goods, buyers’ profits
and sellers’ feedback scores are the same as in the empirical application. We simulate behavior using the
proposed NH-HMME model, and the model parameters are close to the values estimated from the empirical
application (see Table B1.1 and B1.2 on parameter recovery).
Highest Density Interval Effective
True parameter Estimated mean 2.5% 97.5% Sample size 𝑅Ö
Quality rule
𝛽u,×ØÙÚÛÜÚÝÙ -5 -2.905 -4.078 -1.862 685.149 0.999
𝛽u,Þßàá×â 2 1.802 1.439 2.215 750.817 0.999
𝛽u,ãÚÚäåàÜæçÜèÛÚ -0.1 -0.915 -1.578 -0.358 584.268 1.002
𝛽v,×ØÙÚÛÜÚÝÙ -15 -9.869 -11.674 -8.228 776.121 0.999
𝛽v,Þßàá×â 5 4.259 3.622 4.948 607.354 0.999
𝛽v,ãÚÚäåàÜæçÜèÛÚ -0.2 -1.539 -2.351 -0.756 651.036 1.005
Profit rule
𝛽u,×ØÙÚÛÜÚÝÙ 0 -1.460 -2.535 -0.411 723.977 1.004
𝛽u,ÝÛèã×Ù 0.1 0.088 0.056 0.122 1000.000 0.999
𝛽u,ãÚÚäåàÜæçÜèÛÚ -0.05 0.687 0.160 1.233 698.243 1.004
𝛽v,×ØÙÚÛÜÚÝÙ -3 -4.356 -6.364 -2.410 732.033 1.009
𝛽v,ÝÛèã×Ù 0.15 0.138 0.076 0.198 1000.000 0.999
𝛽v,ãÚÚäåàÜæçÜèÛÚ 0.1 0.714 -0.336 1.696 694.871 1.005
Table B1.1: Parameter recovery, with data generated from the NH-HMME model, for the emission model
36
Highest Density Interval Effective
True parameter Estimated mean 2.5% 97.5% Sample size 𝑅Ö
ρ¦ 2 2.098 0.155 4.176 804.819 1.000
ρ§ 1 1.388 -1.173 4.147 791.055 1.000
τ¦ 0.5 0.423 -0.582 1.435 773.110 0.999
τ§ 0.5 0.378 -1.050 1.804 772.107 0.999
πì 0.7 0.631 0.433 0.823 1000.000 1.001
πí 0.3 0.369 0.177 0.567 1000.000 1.001
Log-posterior -1187.82 -1195.06 -1183.04 301.454 0.999
Table B1.2: Parameter recovery, with data generated from the NH-HMME model, for the transition model
Model parameters are well-recovered, as the 95% highest density intervals of the recovered parameters
contain their true values.
We also explored the identification of the hidden feedback rules. Figure B1 shows that the probabilities of
using the quality rule against the filtered probabilities, and the proportion of the quality rule, as estimated
through Viterbi decoding. The three quantities are very similar. The overall hidden state recovery rate is
85.62%
Table B1: True vs. filtered probabilities using the quality rule, and quality state proportions recovered using
Viterbi decoding.
37
Appendix C: Results of the NH-HMME model
C1: Model parameter estimates and 95% highest density intervals for the “3-star, rate transaction”
treatment.
Highest Density Interval Effective
Mean 2.5% 97.5% Sample size 𝑅Ö
Quality rule
𝛽u,×ØÙÚÛÜÚÝÙ -6.849 -8.090 -5.845 452.484 1.000
𝛽u,Þßàá×â 3.197 2.846 3.594 372.926 1.004
𝛽u,ãÚÚäåàÜæçÜèÛÚ -0.078 -0.546 0.416 505.101 0.998
𝛽v,×ØÙÚÛÜÚÝÙ -19.274 -21.187 -17.675 491.895 1.005
𝛽v,Þßàá×â 6.663 6.100 7.302 346.514 1.007
𝛽v,ãÚÚäåàÜæçÜèÛÚ -0.154 -0.909 0.529 478.314 0.999
Profit rule
𝛽u,×ØÙÚÛÜÚÝÙ -0.013 -1.135 1.149 378.613 0.999
𝛽u,ÝÛèã×Ù 0.076 0.033 0.124 524.080 1.001
𝛽u,ãÚÚäåàÜæçÜèÛÚ -0.050 -0.723 0.544 420.336 0.998
𝛽v,×ØÙÚÛÜÚÝÙ -3.458 -5.085 -1.812 461.768 1.004
𝛽v,ÝÛèã×Ù 0.133 0.069 0.196 494.422 0.999
𝛽v,ãÚÚäåàÜæçÜèÛÚ 0.906 0.102 1.672 387.666 1.004
Table C1.1: Emission model estimates for the “3-star, rate transaction” treatment
Highest Density Interval Effective
Mean 2.5% 97.5% Sample size 𝑅Ö
ρ¦ 2.228 -0.951 5.329 600.000 1.007
ρ§ 2.343 -0.887 5.627 460.742 1.002
τ¦ 2.822 0.680 5.765 394.601 1.010
τ§ 0.512 -1.190 2.312 441.438 1.002
πì 0.716 0.562 0.846 600.000 0.999
πí 0.284 0.154 0.438 600.000 0.999
Table C1.2: Transition model estimates for the “3-star, rate transaction” treatment
38
C2: Model parameter estimates and 95% highest density intervals for the “3-star, rate quality” treatment.
Highest Density Interval Effective
Mean 2.5% 97.5% Sample size 𝑅Ö
Quality rule
𝛽u,×ØÙÚÛÜÚÝÙ -5.175 -6.095 -4.052 536.761 1.001
𝛽u,Þßàá×â 3.570 3.164 3.992 360.233 1.002
𝛽u,ãÚÚäåàÜæçÜèÛÚ -1.031 -1.482 -0.545 444.684 1.001
𝛽v,×ØÙÚÛÜÚÝÙ -19.497 -21.339 -17.544 600.000 0.997
𝛽v,Þßàá×â 7.672 6.991 8.286 342.561 0.999
𝛽v,ãÚÚäåàÜæçÜèÛÚ -1.378 -2.090 -0.662 515.993 1.005
Profit rule
𝛽u,×ØÙÚÛÜÚÝÙ -0.479 -2.248 1.207 376.352 1.005
𝛽u,ÝÛèã×Ù 0.444 0.295 0.630 536.345 1.000
𝛽u,ãÚÚäåàÜæçÜèÛÚ -0.236 -1.135 0.627 383.615 1.000
𝛽v,×ØÙÚÛÜÚÝÙ -5.049 -7.286 -2.966 323.200 1.002
𝛽v,ÝÛèã×Ù 0.721 0.544 0.956 495.816 1.001
𝛽v,ãÚÚäåàÜæçÜèÛÚ 0.576 -0.603 1.624 365.510 1.004
Table C2.1: Emission model estimates for the “3-star, rate quality” treatment
Highest Density Interval Effective
Mean 2.5% 97.5% Sample size 𝑅Ö
ρ¦ 1.238 -1.742 4.210 437.827 1.000
ρ§ -0.148 -2.950 3.029 477.508 1.003
τ¦ 2.290 0.711 4.158 429.653 0.998
τ§ 1.889 0.223 3.352 396.496 1.003
πì 0.810 0.672 0.919 600.000 0.999
πí 0.190 0.081 0.328 600.000 0.999
Table C2.2: Transition model estimates for the “3-star, rate quality” treatment
39
C3: Model parameter estimates and 95% highest density intervals for the “5-star, rate transaction”
treatment.
Highest Density Interval Effective
Mean 2.5% 97.5% Sample size 𝑅Ö
Quality rule
𝛽u,×ØÙÚÛÜÚÝÙ -1.652 -2.550 -0.838 463.348 0.999
𝛽u,Þßàá×â 3.422 2.842 4.006 310.507 1.004
𝛽u,ãÚÚäåàÜæçÜèÛÚ -1.088 -1.506 -0.682 317.744 1.001
𝛽v,×ØÙÚÛÜÚÝÙ -6.856 -8.057 -5.667 530.018 1.002
𝛽v,Þßàá×â 6.636 5.948 7.417 279.252 1.005
𝛽v,ãÚÚäåàÜæçÜèÛÚ -1.763 -2.247 -1.300 291.738 1.004
𝛽r,×ØÙÚÛÜÚÝÙ -16.348 -17.916 -14.775 558.056 1.000
𝛽r,Þßàá×â 9.755 8.960 10.662 250.655 1.002
𝛽r,ãÚÚäåàÜæçÜèÛÚ -2.142 -2.731 -1.623 216.685 1.008
𝛽=,×ØÙÚÛÜÚÝÙ -29.979 -32.048 -27.835 600.000 0.999
𝛽=,Þßàá×â 13.099 12.158 14.086 265.331 1.002
𝛽=,ãÚÚäåàÜæçÜèÛÚ -2.473 -3.184 -1.847 254.040 1.000
Profit rule
𝛽u,×ØÙÚÛÜÚÝÙ 1.107 -0.109 2.290 366.422 0.997
𝛽u,ÝÛèã×Ù 0.062 0.022 0.102 600.000 1.001
𝛽u,ãÚÚäåàÜæçÜèÛÚ -0.487 -0.881 -0.070 360.349 0.997
𝛽v,×ØÙÚÛÜÚÝÙ -0.941 -2.063 0.379 285.049 1.005
𝛽v,ÝÛèã×Ù 0.143 0.102 0.184 494.772 1.002
𝛽v,ãÚÚäåàÜæçÜèÛÚ 0.065 -0.359 0.410 260.350 1.006
𝛽r,×ØÙÚÛÜÚÝÙ -3.105 -4.379 -1.807 239.882 1.003
𝛽r,ÝÛèã×Ù 0.228 0.177 0.276 459.346 1.012
𝛽r,ãÚÚäåàÜæçÜèÛÚ 0.413 0.025 0.812 249.988 1.001
𝛽=,×ØÙÚÛÜÚÝÙ -4.693 -6.204 -3.108 214.325 1.010
𝛽=,ÝÛèã×Ù 0.284 0.229 0.342 392.190 1.010
𝛽=,ãÚÚäåàÜæçÜèÛÚ 0.589 0.154 1.030 223.695 1.008
Table C3.1: Emission model estimates for the “5-star, rate transaction” treatment
40
Highest Density Interval Effective
Mean 2.5% 97.5% Sample size 𝑅Ö
ρ¦ 1.959 -1.343 5.427 600.000 1.001
ρ§ 0.970 -2.557 4.646 461.907 1.000
τ¦ 1.091 -0.089 2.372 600.000 1.003
τ§ 0.905 -0.372 2.277 464.814 1.000
πì 0.638 0.479 0.787 600.000 1.009
πí 0.362 0.213 0.521 600.000 1.009
Table C3.2: Transition model estimates for the “5-star, rate transaction” treatment
C4: Model parameter estimates and 95% highest density intervals for the “5-star, rate quality” treatment.
Highest Density Interval Effective
Mean 2.5% 97.5% Sample size 𝑅Ö
Quality rule
𝛽u,×ØÙÚÛÜÚÝÙ -2.763 -3.522 -1.951 427.413 0.999
𝛽u,Þßàá×â 3.546 3.078 4.004 346.070 1.000
𝛽u,ãÚÚäåàÜæçÜèÛÚ -0.883 -1.187 -0.581 298.463 1.004
𝛽v,×ØÙÚÛÜÚÝÙ -8.879 -9.994 -7.812 400.690 0.999
𝛽v,Þßàá×â 7.215 6.603 7.926 271.993 1.005
𝛽v,ãÚÚäåàÜæçÜèÛÚ -1.691 -2.138 -1.222 265.299 1.010
𝛽r,×ØÙÚÛÜÚÝÙ -18.387 -19.846 -17.016 422.690 0.998
𝛽r,Þßàá×â 10.065 9.377 10.771 232.075 1.004
𝛽r,ãÚÚäåàÜæçÜèÛÚ -1.760 -2.250 -1.247 258.804 1.006
𝛽=,×ØÙÚÛÜÚÝÙ -32.836 -34.717 -31.037 501.201 1.000
𝛽=,Þßàá×â 13.733 12.976 14.521 224.273 1.009
𝛽=,ãÚÚäåàÜæçÜèÛÚ -2.244 -2.809 -1.678 273.720 1.006
41
Profit rule
𝛽u,×ØÙÚÛÜÚÝÙ -0.739 -2.412 1.135 508.889 0.999
𝛽u,ÝÛèã×Ù 0.048 -0.020 0.114 600.000 0.999
𝛽u,ãÚÚäåàÜæçÜèÛÚ -0.171 -0.848 0.399 410.630 0.999
𝛽v,×ØÙÚÛÜÚÝÙ -2.306 -4.022 -0.817 600.000 1.004
𝛽v,ÝÛèã×Ù 0.072 0.004 0.132 554.966 0.998
𝛽v,ãÚÚäåàÜæçÜèÛÚ 0.410 -0.111 0.996 514.116 1.002
𝛽r,×ØÙÚÛÜÚÝÙ -2.670 -4.394 -0.842 522.003 1.000
𝛽r,ÝÛèã×Ù 0.105 0.037 0.179 600.000 0.997
𝛽r,ãÚÚäåàÜæçÜèÛÚ 0.290 -0.354 0.891 540.582 0.999
𝛽=,×ØÙÚÛÜÚÝÙ -3.040 -4.897 -1.167 600.000 0.998
𝛽=,ÝÛèã×Ù 0.150 0.071 0.242 600.000 0.997
𝛽=,ãÚÚäåàÜæçÜèÛÚ 0.207 -0.392 0.773 600.000 0.998
Table C4.1: Emission model estimates for the “5-star, rate quality” treatment
Highest Density Interval Effective
Mean 2.5% 97.5% Sample size 𝑅Ö
ρ¦ 1.445 -1.532 4.273 411.812 1.007
ρ§ 1.900 -1.252 4.950 600.000 1.003
τ¦ 1.640 0.532 3.025 459.120 1.005
τ§ 0.177 -0.840 1.419 441.828 1.003
πì 0.806 0.671 0.919 600.000 1.001
πí 0.194 0.081 0.329 600.000 1.001
Table C4.2: Transition model estimates for the “5-star, rate quality” treatment