an empirical evaluation of thompson sampling (2011)

7/29/2019 An Empirical Evaluation of Thompson Sampling (2011)

1/9

An Empirical Evaluation of Thompson Sampling

Olivier ChapelleYahoo! ResearchSanta Clara, CA

[email protected]

Lihong LiYahoo! ResearchSanta Clara, CA

[email protected]

Abstract

Thompson sampling is one of oldest heuristic to address the exploration / ex-ploitation trade-off, but it is surprisingly unpopular in the literature. We presenthere some empirical results using Thompson sampling on simulated and real data,

and show that it is highly competitive. And since this heuristic is very easy toimplement, we argue that it should be part of the standard baselines to compareagainst.

1 Introduction

Various algorithms have been proposed to solve exploration / exploitation or bandit problems. Oneof the most popular is Upper Confidence Bound or UCB [7, 3], for which strong theoretical guar-antees on the regret can be proved. Another representative is the Bayes-optimal approach of Gittins[4] that directly maximizes expected cumulative payoffs with respect to a given prior distribution.A less known family of algorithms is the so-called probability matching. The idea of this heuristicis old and dates back to [16]. This is the reason why this scheme is also referred to as Thompson

sampling.

The idea of Thompson sampling is to randomly draw each arm according to its probability of beingoptimal. In contrast to a full Bayesian method like Gittins index, one can often implement Thompsonsampling efficiently. Recent results using Thompson sampling seem promising [5, 6, 14, 12]. Thereason why it is not very popular might be because of its lack of theoretical analysis. Only twopapers have tried to provide such analysis, but they were only able to prove asymptotic convergence[6, 11].

In this work, we present some empirical results, first on a simulated problem and then on two real-world ones: display advertisement selection and news article recommendation. In all cases, despiteits simplicity, Thompson sampling achieves state-of-the-art results, and in some cases significantlyoutperforms other alternatives like UCB. The findings suggest the necessity to include Thompsonsampling as part of the standard baselines to compare against, and to develop finite-time regret boundfor this empirically successful algorithm.

2 Algorithm

The contextual bandit setting is as follows. At each round we have a context x (optional) and a setof actions A. After choosing an action a A, we observe a reward r. The goal is to find a policythat selects actions such that the cumulative reward is as large as possible.

Thompson sampling is best understood in a Bayesian setting as follows. The set of past observationsD is made of triplets (xi, ai, ri) and are modeled using a parametric likelihood function P(r|a,x,)depending on some parameters . Given some prior distribution P() on these parameters, the pos-terior distribution of these parameters is given by the Bayes rule, P(|D)

P(ri|ai, xi, )P().

1


2/9

In the realizable case, the reward is a stochastic function of the action, context and the unknown,true parameter . Ideally, we would like to choose the action maximizing the expected reward,maxa E(r|a,x,).

Of course, is unknown. If we are just interested in maximizing the immediate reward (exploita-tion), then one should choose the action that maximizes E(r|a, x) = E(r|a,x,)P(|D)d.But in an exploration / exploitation setting, the probability matching heuristic consists in randomlyselecting an action a according to its probability of being optimal. That is, action a is chosen withprobability

I

E(r|a,x,) = max

aE(r|a, x , )

P(|D)d,

where I is the indicator function. Note that the integral does not have to be computed explicitly: itsuffices to draw a random parameter at each round as explained in Algorithm 1. Implementationof the algorithm is thus efficient and straightforward in most applications.

Algorithm 1 Thompson sampling

D = for t = 1, . . . , T do

Receive context xtDraw t according to P(|D)Select at = arg maxa Er(r|xt, a , t)Observe reward rtD = D (xt, at, rt)

end for

In the standard K-armed Bernoulli bandit, each action corresponds to the choice of an arm. Thereward of the i-th arm follows a Bernoulli distribution with mean i . It is standard to model the meanreward of each arm using a Beta distribution since it is the conjugate distribution of the binomialdistribution. The instantiation of Thompson sampling for the Bernoulli bandit is given in algorithm2. It is straightforward to adapt the algorithm to the case where different arms use different Betadistributions as their priors.

Algorithm 2 Thompson sampling for the Bernoulli banditRequire: , prior parameters of a Beta distribution

Si = 0, Fi = 0, i. {Success and failure counters}for t = 1, . . . , T do

for i = 1, . . . , K doDraw i according to Beta(Si + , Fi + ).

end forDraw arm = arg maxi i and observe reward rifr = 1 then

S = S + 1else

F = F + 1end if

end for

3 Simulations

We present some simulation results with Thompson sampling for the Bernoulli bandit problem andcompare them to the UCB algorithm. The reward probability of each of the K arms is modeledby a Beta distribution which is updated after an arm is selected (see algorithm 2). The initial priordistribution is Beta(1,1).

There are various variants of the UCB algorithm, but they all have in common that the confidenceparameter should increase over time. Specifically, we chose the arm for which the following upper

2


3/9

confidence bound [8, page 278] is maximum:

k

m+

2 km log

1

m+

2log 1m

, =

1

t, (1)

where m is the number of times the arm has been selected and k its total reward. This is a tight

upper confidence bound derived from Chernoffs bound.

In this simulation, the best arm has a reward probability of 0.5 and the K 1 other arms have aprobability of0.5 . In order to speed up the computations, the parameters are only updated afterevery 100 iterations. The regret as a function of T for various settings is plotted in figure 1. Anasymptotic lower bound has been established in [7] for the regret of a bandit algorithm:

R(T) log(T)

Ki=1

p piD(pi||p)

+ o(1)

, (2)

where pi is the reward probability of the i-th arm, p = maxpi and D is the Kullback-Leibler

divergence. This lower bound is logarithmic in T with a constant depending on the pi values. Theplots in figure 1 show that the regrets are indeed logarithmic in T (the linear trend on the right handside) and it turns out that the observed constants (slope of the lines) are close to the optimal constants

given by the lower bound (2). Note that the offset of the red curve is irrelevant because of the o(1)term in the lower bound (2). In fact, the red curves were shifted such that they pass through thelower left-hand corner of the plot.

102

103

104

105

106

0

100

200

300

400

500

600

700

800

900

T

Regret

K=10, =0.1

102

103

104

105

106

107

0

2000

4000

6000

8000

10000

T

Regret

K=100, =0.1

102

103

104

105

106

107

0

500

1000

1500

2000

2500

3000

3500

4000

T

Regret

K=10, =0.02

102

104

106

108

0

1

2

3

4

5x 10

4

T

Regret

K=100, =0.02

ThompsonUCBAsymptotic lower bound




Figure 1: Cumulative regret for K {10, 100} and {0.02, 0.1}. The plots are averaged over100 repetitions. The red line is the lower bound (2) shifted such that it goes through the origin.

As with any Bayesian algorithm, one can wonder about the robustness of Thompson sampling toprior mismatch. The results in figure 1 include already some prior mismatch because the Beta priorwith parameters (1,1) has a large variance while the true probabilities were selected to be close to

3


4/9

102

103

104

105

106

107

0

200

400

600

800

1000

1200

1400

1600

1800

T

Regret

ThompsonOptimistic Thompson

Figure 2: Regret of optimistic Thompson sampling [11] in the same setting as the lower left plot offigure 1.

0.5. We have also done some other simulations (not shown) where there is a mismatch in the priormean. In particular, when the reward probability of the best arm is 0.1 and the 9 others have aprobability of 0.08, Thompson samplingwith the same prior as beforeis still better than UCBand is still asymptotically optimal.

We can thus conclude that in these simulations, Thompson sampling is asymptotically optimal andachieves a smaller regret than the popular UCB algorithm. It is important to note that for UCB,the confidence bound (1) is tight; we have tried some other confidence bounds, including the oneoriginally proposed in [3], but they resulted in larger regrets.

Optimistic Thompson sampling The intuition behind UCB and Thompson sampling is that, forthe purpose of exploration, it is beneficial to boost the predictions of actions for which we are uncer-tain. But Thompson sampling modifies the predictions in both directions and there is apparently nobenefit in decreasing a prediction. This observation led to a recently proposed algorithm called Op-

timistic Bayesian sampling [11] in which the modified score is never smaller than the mean. Moreprecisely, in algorithm 1, Er(r|xt, a , t) is replaced by max(Er(r|xt, a , t),Er,|D(r|xt, a , )).

Simulations in [12] showed some gains using this optimistic version of Thompson sampling. Wecompared in figure 2 the two versions of Thompson sampling in the case K = 10 and = 0.02.Optimistic Thompson sampling achieves a slightly better regret, but the gain is marginal. A pos-sible explanation is that when the number of arms is large, it is likely that, in standard Thompsonsampling, the selected arm has a already a boosted score.

Posterior reshaping Thompson sampling is a heuristic advocating to draw samples from the pos-terior, but one might consider changing that heuristic to draw samples from a modified distribution.In particular, sharpening the posterior would have the effect of increasing exploitation while widen-ing it would favor exploration. In our simulations, the posterior is a Beta distribution with parame-ters a and b, and we have tried to change it to parameters a/, b/. Doing so does not change theposterior mean, but multiply its variance by a factor close to 2.

Figure 3 shows the average and distribution of regret for different values of. Values of smallerthan 1 decrease the amount of exploration and often result in lower regret. But the price to pay is ahigher variance: in some runs, the regret is very large. The average regret is asymptotically not asgood as with = 1, but tends to be better in the non-asymptotic regime.

Impact of delay In a real world system, the feedback is typically not processed immediatelybecause of various runtime constraints. Instead it usually arrives in batches over a certain period oftime. We now try to quantify the impact of this delay by doing some simulations that mimic theproblem of news articles recommendation [9] that will be described in section 5.

4


5/9

102

103

104

105

106

107

0

500

1000

1500

2000

2500

3000

3500

4000

4500

T

Regret

=2

=1

=0.5

=0.25

Asymptotic lower bound

2 1 0.5 0.25

0

1000

2000

3000

4000

5000

6000

7000

Regret

Figure 3: Thompson sampling where the parameters of the Beta posterior distribution have beendivided by . The setting is the same as in the lower left plot of figure 1 (1000 repetitions). Left:average regret as a function ofT. Right: distribution of the regret at T = 107. Since the outliers cantake extreme values, those above 6000 are compressed at the top of the figure.

Table 1: Influence of the delay: regret when the feedback is provided every steps.

1 3 10 32 100 316 1000UCB 24,145 24,695 25,662 28,148 37,141 77,687 226,220TS 9,105 9,199 9,049 9,451 11,550 21,594 59,256Ratio 2.65 2.68 2.84 2.98 3.22 3.60 3.82

We consider a dynamic set of 10 items. At a given time, with probability 103 one of the item retiresand is replaced by a new one. The true reward probability of a given item is drawn according to aBeta(4,4) distribution. The feedback is received only every time units. Table 1 shows the averageregret (over 100 repetitions) of Thompson sampling and UCB at T = 106. An interesting quantityin this simulation is the relative regret of UCB and Thompson sampling. It appears that Thompsonsampling is more robust than UCB when the delay is long. Thompson sampling alleviates theinfluence of delayed feedback by randomizing over actions; on the other hand, UCB is deterministicand suffers a larger regret in case of a sub-optimal choice.

4 Display Advertising

We now consider an online advertising application. Given a user visiting a publisher page, theproblem is to select the best advertisement for that user. A key element in this matching problem isthe click-through rate (CTR) estimation: what is the probability that a given ad will be clicked givensome context (user, page visited)? Indeed, in a cost-per-click (CPC) campaign, the advertiser onlypays when his ad gets clicked. This is the reason why it is important to select ads with high CTRs.There is of course a fundamental exploration / exploitation dilemma here: in order to learn the CTRof an ad, it needs to be displayed, leading to a potential loss of short-term revenue. More details onon display advertising and the data used for modeling can be found in [1].

In this paper, we consider standard regularized logistic regression for predicting CTR. There areseveral features representing the user, page, ad, as well as conjunctions of these features. Someof the features include identifiers of the ad, advertiser, publisher and visited page. These featuresare hashed [17] and each training sample ends up being represented as sparse binary vector ofdimension 224.

In our model, the posterior distribution on the weights is approximated by a Gaussian distributionwith diagonal covariance matrix. As in the Laplace approximation, the mean of this distribution isthe mode of the posterior and the inverse variance of each weight is given by the curvature. The use

5


6/9

of this convenient approximation of the posterior is twofold. It first serves as a prior on the weightsto update the model when a new batch of training data becomes available, as described in algorithm3. And it is also the distribution used in Thompson sampling.

Algorithm 3 Regularized logistic regression with batch updates

Require: Regularization parameter > 0.mi = 0, qi = . {Each weight wi has an independent prior N(mi, q1i )}for t = 1, . . . , T do

Get a new batch of training data (xj , yj), j = 1, . . . , n.

Find w as the minimizer of:1

2

di=1

qi(wi mi)2 +

nj=1

log(1 + exp(yjwxj)).

mi = wi

qi = qi +n

j=1

x2ijpj(1 pj), pj = (1 + exp(wxj))

1 {Laplace approximation}

end for

Evaluating an explore / exploit policy is difficult because we typically do not know the reward of anaction that was not chosen. A possible solution, as we shall see in section 5, is to use a replayer in

which previous, randomized exploration data can be used to produce an unbiased offline estimatorof the new policy [10]. Unfortunately, their approach cannot be used in our case here becauseit reduces the effective data size substantially when the number of arms K is large, yielding toohigh variance in the evaluation results. [15] studies another promising approach using the idea ofimportance weighting, but the method applies only when the policy is static, which is not the casefor online bandit algorithms that constantly adapt to its history.

For the sake of simplicity, therefore, we considered in this section a simulated environment. Moreprecisely, the context and the ads are real, but the clicks are simulated using a weight vector w.This weight vector could have been chosen arbitrarily, but it was in fact a perturbed version of someweight vector learned from real clicks. The input feature vectors x are thus as in the real world set-ting, but the clicks are artificially generated with probability P(y = 1|x) = (1 + exp(wx))1.

About 13,000 contexts, representing a small random subset of the total traffic, are presented everyhour to the policy which has to choose an ad among a set of eligible ads. The number of eligible adsfor each context depends on numerous constraints set by the advertiser and the publisher. It variesbetween 5,910 and 1 with a mean of 1,364 and a median of 514 (over a set of 66,373 ads). Note thatin this experiment, the number of eligible ads is smaller than what we would observe in live trafficbecause we restricted the set of advertisers.

The model is updated every hour as described in algorithm 3. A feature vector is constructed forevery (context, ad) pair and the policy decides which ad to show. A click for that ad is then generatedwith probability (1 + exp(wx))1. This labeled training sample is then used at the end of thehour to update the model. The total number of clicks received during this one hour period is thereward. But in order to eliminate unnecessary variance in the estimation, we instead computed theexpectation of that number since the click probabilities are known.

Several explore / exploit strategies are compared; they only differ in the way the ads are selected; allthe rest, including the model updates, is identical as described in algorithm 3. These strategies are:

Thompson sampling This is algorithm 1 where each weight is drawn independently according toits Gaussian posterior approximation N(mi, q

1i ) (see algorithm 3). As in section 3, we

also consider a variant in which the standard deviations q1/2i are first multiplied by a factor

{0.25, 0.5}. This favors exploitation over exploration.LinUCB This is an extension of the UCB algorithm to the parametric case [9]. It selects the

ad based on mean and standard deviation. It also has a factor to control the ex-ploration / exploitation trade-off. More precisely, LinUCB selects the ad for whichd

i=1 mixi + d

i=1 q1i x

2i is maximum.

Exploit-only Select the ad with the highest mean.

Random Select the ad uniformly at random.

6


7/9

Table 2: CTR regrets on the display advertising data.Method TS LinUCB -greedy Exploit Random

Parameter 0.25 0.5 1 0.5 1 2 0.005 0.01 0.02

Regret (%) 4.45 3.72 3.81 4.99 4.22 4.14 5.05 4.98 5.22 5.00 31.95

0 10 20 30 40 50 60 70 80 900

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

0.1

Hour

CTRr

egret

Thompson

LinUCB

Exploit

Figure 4: CTR regret over the 4 days test period for 3 algorithms: Thompson sampling with = 0.5,LinUCB with = 2, Exploit-only. The regret in the first hour is large, around 0.3, because thealgorithms predict randomly (no initial model provided).

-greedy Mix between exploitation and random: with probability, select a random ad; otherwise,select the one with the highest mean.

Results A preliminary result is about the quality of the variance prediction. The diagonal Gaussianapproximation of the posterior does not seem to harm the variance predictions. In particular, theyare very well calibrated: when constructing a 95% confidence interval for CTR, the true CTR is inthis interval 95.1% of the time.

The regrets of the different explore / exploit strategies can be found in table 2. Thompson samplingachieves the best regret and interestingly the modified version with = 0.5 gives slightly betterresults than the standard version ( = 1). This confirms the results of the previous section (figure 3)where < 1 yielded better regrets in the non-asymptotic regime.

Exploit-only does pretty well, at least compared to random selection. This seems at first a bit sur-prising given that the system has no prior knowledge about the CTRs. A possible explanation is thatthe change in context induces some exploration, as noted in [13]. Also, the fact that exploit-onlyis so much better than random might explain why -greedy does not beat it: whenever this strat-egy chooses a random action, it suffers a large regret in average which is not compensated by itsexploration benefit.

Finally figure 4 shows the regret of three algorithms across time. As expected, the regret has adecreasing trend over time.

5 News Article Recommendation

In this section, we consider another application of Thompson sampling in personalized news articlerecommendation on Yahoo! front page [2, 9]. Each time a user visits the portal, a news article outof a small pool of hand-picked candidates is recommended. The candidate pool is dynamic: oldarticles may retire and new articles may be added in. The average size of the pool is around 20.The goal is to choose the most interesting article to users, or formally, maximize the total number ofclicks on the recommended articles. In this case, we treat articles as arms, and define the payoff tobe 1 if the article is clicked on and 0 otherwise. Therefore, the average per-trial payoff of a policy isits overall CTR.

7


8/9

10 30 601

1.2

1.4

1.6

1.8

2

Delay (min)

NormalizedCTR

TS 0.5TS 1OTS 0.5OTS 1

UCB 1UCB 2

UCB 5EG 0.05EG 0.1

Exploit

Figure 5: Normalized CTRs of various algorithm on the news article recommendation data with dif-ferent update delays: {10, 30, 60} minutes. The normalization is with respect to a random baseline.

Each user was associated with a binary raw feature vector of over 1000 dimension, which indicatesinformation of the user like age, gender, geographical location, behavioral targeting, etc. Thesefeatures are typically sparse, so using them directly makes learning more difficult and is compu-tationally expensive. One can find lower dimension feature subspace by, say, following previouspractice [9]. Here, we adopted the simpler principal component analysis (PCA), which did not ap-

pear to affect the bandit algorithms much in our experience. In particular, we performed a PCA andprojected the raw user feature onto the first 20 principal components. Finally, a constant feature 1 isappended, so that the final user feature contains 21 components. The constant feature serves as thebias term in the CTR model described next.

We use logistic regression, as in Algorithm 3, to model article CTRs: given a user feature vectorx 21, the probability of click on an article a is (1 + exp(xwa))1 for some weight vectorwa 21 to be learned. The same parameter algorithm and exploration heuristics are applied as inthe previous section. Note that we have a different weight vector for each article, which is affordableas the numbers of articles and features are both small. Furthermore, given the size of data, we havenot found article features to be helpful. Indeed, it is shown in our previous work [9, Figure 5] thatarticle features are helpful in this domain only when data are highly sparse.

Given the small size of candidate pool, we adopt the unbiased offline evaluation method of [10]to compare various bandit algorithms. In particular, we collected randomized serving events for

a random fraction of user visits; in other words, these random users were recommended an articlechosen uniformly from the candidate pool. From 7 days in June 2009, over 34M randomized servingevents were obtained.

As in section 3, we varied the update delay to study how various algorithms degrade. Three valueswere tried: 10, 30, and 60 minutes. Figure 5 summarizes the overall CTRs of four families ofalgorithm together with the exploit-only baseline. As in the previous section, (optimistic) Thompsonsampling appears competitive across all delays. While the deterministic UCB works well with shortdelay, its performance drops significantly as the delay increases. In contrast, randomized algorithmsare more robust to delay, and when there is a one-hour delay, (optimistic) Thompson sampling issignificant better than others (given the size of our data).

6 Conclusion

The extensive experimental evaluation carried out in this paper reveals that Thompson sampling is avery effective heuristic for addressing the exploration / exploitation trade-off. In its simplest form,it does not have any parameter to tune, but our results show that tweaking the posterior to reduceexploration can be beneficial. In any case, Thompson sampling is very easy to implement and shouldthus be considered as a standard baseline. Also, since it is a randomized algorithm, it is robust in thecase of delayed feedback.

Future work includes of course, a theoretical analysis of its finite-time regret. The benefit of thisanalysis would be twofold. First, it would hopefully contribute to make Thompson sampling aspopular as other algorithms for which regret bounds exist. Second, it could provide guidance ontweaking the posterior in order to achieve a smaller regret.

8


9/9

References

[1] D. Agarwal, R. Agrawal, R. Khanna, and N. Kota. Estimating rates of rare events with multiplehierarchies through scalable log-linear models. In Proceedings of the 16th ACM SIGKDDinternational conference on Knowledge discovery and data mining, pages 213222, 2010.

[2] Deepak Agarwal, Bee-Chung Chen, Pradheep Elango, Nitin Motgi, Seung-Taek Park, Raghu

Ramakrishnan, Scott Roy, and Joe Zachariah. Online models for content optimization. InAdvances in Neural Information Processing Systems 21, pages 1724, 2008.

[3] P. Auer, N. Cesa-Bianchi, and P. Fischer. Finite-time analysis of the multiarmed bandit prob-lem. Machine learning, 47(2):235256, 2002.

[4] John C. Gittins. Multi-armed Bandit Allocation Indices. Wiley Interscience Series in Systemsand Optimization. John Wiley & Sons Inc, 1989.

[5] Thore Graepel, Joaquin Quinonero Candela, Thomas Borchert, and Ralf Herbrich. Web-scaleBayesian click-through rate prediction for sponsored search advertising in Microsofts Bingsearch engine. In Proceedings of the Twenty-Seventh International Conference on MachineLearning (ICML-10), pages 1320, 2010.

[6] O.-C. Granmo. Solving two-armed bernoulli bandit problems using a bayesian learning au-tomaton. International Journal of Intelligent Computing and Cybernetics, 3(2):207234, 2010.

[7] T.L. Lai and H. Robbins. Asymptotically efficient adaptive allocation rules. Advances inapplied mathematics, 6:422, 1985.

[8] J. Langford. Tutorial on practical prediction theory for classification. Journal of MachineLearning Research, 6(1):273306, 2005.

[9] L. Li, W. Chu, J. Langford, and R. E. Schapire. A contextual-bandit approach to personalizednews article recommendation. In Proceedings of the 19th international conference on Worldwide web, pages 661670, 2010.

[10] L. Li, W. Chu, J. Langford, and X. Wang. Unbiased offline evaluation of contextual-bandit-based news article recommendation algorithms. In Proceedings of the fourth ACM interna-tional conference on Web search and data mining, pages 297306, 2011.

[11] Benedict C. May, Nathan Korda, Anthony Lee, and David S. Leslie. Optimistic Bayesiansampling in contextual-bandit problems. Technical Report 11:01, Statistics Group, Department

of Mathematics, University of Bristol, 2011. Submitted to the Annals of Applied Probability.[12] Benedict C. May and David S. Leslie. Simulation studies in optimistic Bayesian sampling in

contextual-bandit problems. Technical Report 11:02, Statistics Group, Department of Mathe-matics, University of Bristol, 2011.

[13] J. Sarkar. One-armed bandit problems with covariates. The Annals of Statistics, 19(4):19782002, 1991.

[14] S. Scott. A modern bayesian look at the multi-armed bandit. Applied Stochastic Models inBusiness and Industry, 26:639658, 2010.

[15] Alexander L. Strehl, John Langford, Lihong Li, and Sham M. Kakade. Learning from loggedimplicit exploration data. In Advances in Neural Information Processing Systems 23 (NIPS-10), pages 22172225, 2011.

[16] William R. Thompson. On the likelihood that one unknown probability exceeds another in

view of the evidence of two samples. Biometrika, 25(34):285294, 1933.[17] K. Weinberger, A. Dasgupta, J. Attenberg, J. Langford, and A. Smola. Feature hashing for

large scale multitask learning. In ICML, 2009.

9

an empirical evaluation of thompson sampling (2011)

Documents