thompson sampling - columbia universitysa3305/thompsonsampling.pdf · 2016. 7. 21. · 1dwxudo dqg...
TRANSCRIPT
![Page 1: Thompson Sampling - Columbia Universitysa3305/ThompsonSampling.pdf · 2016. 7. 21. · 1dwxudo dqg (iilflhqw khxulvwlf 0dlqwdlq eholhi derxw hiihfwlyhqhvv phdq uhzdug ri hdfk dup](https://reader034.vdocuments.mx/reader034/viewer/2022052617/60acc15de1b66930b35efaf5/html5/thumbnails/1.jpg)
for learning in online decision making
Shipra AgrawalIEOR and Data Science InstituteColumbia University
![Page 2: Thompson Sampling - Columbia Universitysa3305/ThompsonSampling.pdf · 2016. 7. 21. · 1dwxudo dqg (iilflhqw khxulvwlf 0dlqwdlq eholhi derxw hiihfwlyhqhvv phdq uhzdug ri hdfk dup](https://reader034.vdocuments.mx/reader034/viewer/2022052617/60acc15de1b66930b35efaf5/html5/thumbnails/2.jpg)
Movie Recommendations Online Retail Content Search
![Page 3: Thompson Sampling - Columbia Universitysa3305/ThompsonSampling.pdf · 2016. 7. 21. · 1dwxudo dqg (iilflhqw khxulvwlf 0dlqwdlq eholhi derxw hiihfwlyhqhvv phdq uhzdug ri hdfk dup](https://reader034.vdocuments.mx/reader034/viewer/2022052617/60acc15de1b66930b35efaf5/html5/thumbnails/3.jpg)
Goal
Limitations
Challenges
• Maximize revenue / customer satisfaction• Customer “buys” or “likes” or “clicks on” at least one of the products
(preferably the most expensive one)
• Limited display space, customer attention• Limited prior knowledge of customer preferences
1. Learn the “likeability” of products 2. Maximize the revenue or clicks
ARE THE TWO TASKS ALIGNED?
How it works?• Recommend product(s) • Observe customer’s response
![Page 4: Thompson Sampling - Columbia Universitysa3305/ThompsonSampling.pdf · 2016. 7. 21. · 1dwxudo dqg (iilflhqw khxulvwlf 0dlqwdlq eholhi derxw hiihfwlyhqhvv phdq uhzdug ri hdfk dup](https://reader034.vdocuments.mx/reader034/viewer/2022052617/60acc15de1b66930b35efaf5/html5/thumbnails/4.jpg)
Dominated by strong female lead
EXPLORE AND EXPLOITExplore for more informative data
Exploit for immediate clicks
Stuck at second best, need to explore
![Page 5: Thompson Sampling - Columbia Universitysa3305/ThompsonSampling.pdf · 2016. 7. 21. · 1dwxudo dqg (iilflhqw khxulvwlf 0dlqwdlq eholhi derxw hiihfwlyhqhvv phdq uhzdug ri hdfk dup](https://reader034.vdocuments.mx/reader034/viewer/2022052617/60acc15de1b66930b35efaf5/html5/thumbnails/5.jpg)
RANDOMLY EXPLORE FOR EVERY POSSIBLE TYPE OF CUSTOMER?
Personalization
![Page 6: Thompson Sampling - Columbia Universitysa3305/ThompsonSampling.pdf · 2016. 7. 21. · 1dwxudo dqg (iilflhqw khxulvwlf 0dlqwdlq eholhi derxw hiihfwlyhqhvv phdq uhzdug ri hdfk dup](https://reader034.vdocuments.mx/reader034/viewer/2022052617/60acc15de1b66930b35efaf5/html5/thumbnails/6.jpg)
Millions of products
RANDOMLY EXPLORE FOR EVERY POSSIBLE TYPE OF PRODUCT?
![Page 7: Thompson Sampling - Columbia Universitysa3305/ThompsonSampling.pdf · 2016. 7. 21. · 1dwxudo dqg (iilflhqw khxulvwlf 0dlqwdlq eholhi derxw hiihfwlyhqhvv phdq uhzdug ri hdfk dup](https://reader034.vdocuments.mx/reader034/viewer/2022052617/60acc15de1b66930b35efaf5/html5/thumbnails/7.jpg)
Trends change, cold start Short period for collecting and utilizing data
EXPLORE, BUT ONLY AS MUCH AS REQUIRED
![Page 8: Thompson Sampling - Columbia Universitysa3305/ThompsonSampling.pdf · 2016. 7. 21. · 1dwxudo dqg (iilflhqw khxulvwlf 0dlqwdlq eholhi derxw hiihfwlyhqhvv phdq uhzdug ri hdfk dup](https://reader034.vdocuments.mx/reader034/viewer/2022052617/60acc15de1b66930b35efaf5/html5/thumbnails/8.jpg)
The multi-armed bandit problem (Thompson 1933; Robbins 1952)
Multiple rigged slot machines in a casino.Which one to put money on?• Try each one out
WHEN TO STOP TRYING (EXPLORATION) AND START PLAYING (EXPLOITATION)?
![Page 9: Thompson Sampling - Columbia Universitysa3305/ThompsonSampling.pdf · 2016. 7. 21. · 1dwxudo dqg (iilflhqw khxulvwlf 0dlqwdlq eholhi derxw hiihfwlyhqhvv phdq uhzdug ri hdfk dup](https://reader034.vdocuments.mx/reader034/viewer/2022052617/60acc15de1b66930b35efaf5/html5/thumbnails/9.jpg)
Online decisions At every time step 1, … , , pull one arm out of arms
Bandit feedback Only the reward of the pulled arm can be observed
Stochastic feedback For each arm , reward is generated i.i.d from a fixed but unknown distribution
support [0,1], mean
![Page 10: Thompson Sampling - Columbia Universitysa3305/ThompsonSampling.pdf · 2016. 7. 21. · 1dwxudo dqg (iilflhqw khxulvwlf 0dlqwdlq eholhi derxw hiihfwlyhqhvv phdq uhzdug ri hdfk dup](https://reader034.vdocuments.mx/reader034/viewer/2022052617/60acc15de1b66930b35efaf5/html5/thumbnails/10.jpg)
Maximize expected reward in time ∑ ∑ ]
Minimize expected regret in time Optimal arm is the arm with expected reward ∗ max Expected regret for playing arm : Δ ∗ Expected regret in any time ,
Δ Any time algorithm: the time horizon is not known
![Page 11: Thompson Sampling - Columbia Universitysa3305/ThompsonSampling.pdf · 2016. 7. 21. · 1dwxudo dqg (iilflhqw khxulvwlf 0dlqwdlq eholhi derxw hiihfwlyhqhvv phdq uhzdug ri hdfk dup](https://reader034.vdocuments.mx/reader034/viewer/2022052617/60acc15de1b66930b35efaf5/html5/thumbnails/11.jpg)
Natural and Efficient heuristic Maintain belief about effectiveness (mean reward) of each arm Observe feedback, update belief of pulled arm i in Bayesian manner Pull arm with posterior probability of being best arm
NOT choose the one most likely to be effective Gives benefit of doubt those less explored
“optimal” benefit of doubt [Agrawal and Goyal, COLT 2012, AISTATS 2013]
![Page 12: Thompson Sampling - Columbia Universitysa3305/ThompsonSampling.pdf · 2016. 7. 21. · 1dwxudo dqg (iilflhqw khxulvwlf 0dlqwdlq eholhi derxw hiihfwlyhqhvv phdq uhzdug ri hdfk dup](https://reader034.vdocuments.mx/reader034/viewer/2022052617/60acc15de1b66930b35efaf5/html5/thumbnails/12.jpg)
Bernoulli i.i.d rewards: Playing arm produces 1 with unknown probability , 0 otherwise
Maintain Beta posteriors on Starting prior? Use a very non-informative prior Beta(1,1)
Beta prior, Bernoulli likelihood → beta posterior Posterior for arm at time , Beta , 1, , 1)
At any time t, play every arm with its posterior probability of being the best arm
![Page 13: Thompson Sampling - Columbia Universitysa3305/ThompsonSampling.pdf · 2016. 7. 21. · 1dwxudo dqg (iilflhqw khxulvwlf 0dlqwdlq eholhi derxw hiihfwlyhqhvv phdq uhzdug ri hdfk dup](https://reader034.vdocuments.mx/reader034/viewer/2022052617/60acc15de1b66930b35efaf5/html5/thumbnails/13.jpg)
Start with uniform prior Beta(1,1) for each At time t=1, 2, … Posterior for arm is Beta( , 1, , 1 Sample from posterior for i Play arm arg max Observe reward 0 or 1 with probability , Update success and failures
for
Bayesian algorithm for frequentist setting!
![Page 14: Thompson Sampling - Columbia Universitysa3305/ThompsonSampling.pdf · 2016. 7. 21. · 1dwxudo dqg (iilflhqw khxulvwlf 0dlqwdlq eholhi derxw hiihfwlyhqhvv phdq uhzdug ri hdfk dup](https://reader034.vdocuments.mx/reader034/viewer/2022052617/60acc15de1b66930b35efaf5/html5/thumbnails/14.jpg)
Optimal instance-dependent bounds for Bernoulli rewards Regret 1 ln ∑
∗|| Matches asymptotic lower bound for any algorithm [Lai Robbins 1985] The popular UCB algorithm achieves this only after careful tuning [Bayes-UCB Kaufmann et al. 2012]
Near-optimal worst-case-instance bounds Regret ln )
Lower bound Ω
Only assumption: Bernoulli likelihood
![Page 15: Thompson Sampling - Columbia Universitysa3305/ThompsonSampling.pdf · 2016. 7. 21. · 1dwxudo dqg (iilflhqw khxulvwlf 0dlqwdlq eholhi derxw hiihfwlyhqhvv phdq uhzdug ri hdfk dup](https://reader034.vdocuments.mx/reader034/viewer/2022052617/60acc15de1b66930b35efaf5/html5/thumbnails/15.jpg)
Suppose reward for arm is i.i.d. , 1 Starting prior N(0,1) Gaussian Prior, Gaussian likelihood → Gaussian posterior , , ,
, is empirical mean of , observations for arm Algorithm
Sample from posterior , , , for arm Play arm arg max Observe reward, update empirical mean for arm
Now apply this algorithm for any reward distribution!
![Page 16: Thompson Sampling - Columbia Universitysa3305/ThompsonSampling.pdf · 2016. 7. 21. · 1dwxudo dqg (iilflhqw khxulvwlf 0dlqwdlq eholhi derxw hiihfwlyhqhvv phdq uhzdug ri hdfk dup](https://reader034.vdocuments.mx/reader034/viewer/2022052617/60acc15de1b66930b35efaf5/html5/thumbnails/16.jpg)
Near-optimal instance-dependent bounds Regret ∑
Matches the best available for UCB for general reward distributions
Near-optimal worst-case-instance bounds Regret ln )
Matches lower bound within logarithmic factors
Only assumption: Bounded or subGaussian noise =
![Page 17: Thompson Sampling - Columbia Universitysa3305/ThompsonSampling.pdf · 2016. 7. 21. · 1dwxudo dqg (iilflhqw khxulvwlf 0dlqwdlq eholhi derxw hiihfwlyhqhvv phdq uhzdug ri hdfk dup](https://reader034.vdocuments.mx/reader034/viewer/2022052617/60acc15de1b66930b35efaf5/html5/thumbnails/17.jpg)
Two arms, , Δ Every time arm 2 is pulled, Δ regret Bound the number of pulls of arm 2 by to get regret bound How many pulls of arm 2 are actually needed?
![Page 18: Thompson Sampling - Columbia Universitysa3305/ThompsonSampling.pdf · 2016. 7. 21. · 1dwxudo dqg (iilflhqw khxulvwlf 0dlqwdlq eholhi derxw hiihfwlyhqhvv phdq uhzdug ri hdfk dup](https://reader034.vdocuments.mx/reader034/viewer/2022052617/60acc15de1b66930b35efaf5/html5/thumbnails/18.jpg)
After n=O pulls of arm 2 and arm 1
Empirical means are well separatedError whp
Beta Posteriors are well separated standard deviation ≃
The two arms can be distinguished!No more arm 2 pulls.
Δ
![Page 19: Thompson Sampling - Columbia Universitysa3305/ThompsonSampling.pdf · 2016. 7. 21. · 1dwxudo dqg (iilflhqw khxulvwlf 0dlqwdlq eholhi derxw hiihfwlyhqhvv phdq uhzdug ri hdfk dup](https://reader034.vdocuments.mx/reader034/viewer/2022052617/60acc15de1b66930b35efaf5/html5/thumbnails/19.jpg)
pulls of arm 2, but few pulls of arm 1
Δ
Δ
![Page 20: Thompson Sampling - Columbia Universitysa3305/ThompsonSampling.pdf · 2016. 7. 21. · 1dwxudo dqg (iilflhqw khxulvwlf 0dlqwdlq eholhi derxw hiihfwlyhqhvv phdq uhzdug ri hdfk dup](https://reader034.vdocuments.mx/reader034/viewer/2022052617/60acc15de1b66930b35efaf5/html5/thumbnails/20.jpg)
Arm 1 will be played roughly every constant number of steps in this situation It will take at most constant steps (extra pulls of arm 2) to get out of
this situation Total number of pulls of arm 2 is at most O
Summary: variance of posterior enables exploration Optimal bounds (up to optimal constants) require more careful use of
posterior structure
![Page 21: Thompson Sampling - Columbia Universitysa3305/ThompsonSampling.pdf · 2016. 7. 21. · 1dwxudo dqg (iilflhqw khxulvwlf 0dlqwdlq eholhi derxw hiihfwlyhqhvv phdq uhzdug ri hdfk dup](https://reader034.vdocuments.mx/reader034/viewer/2022052617/60acc15de1b66930b35efaf5/html5/thumbnails/21.jpg)
Scalability Large number of products and customer types Utilize similarity? Content based recommendation
Customers and products described by their features Similar features means similar preferences Parametric models mapping customer and product features to customer
preferences Contextual bandits
Exploration-exploitation to learn the parametric models
![Page 22: Thompson Sampling - Columbia Universitysa3305/ThompsonSampling.pdf · 2016. 7. 21. · 1dwxudo dqg (iilflhqw khxulvwlf 0dlqwdlq eholhi derxw hiihfwlyhqhvv phdq uhzdug ri hdfk dup](https://reader034.vdocuments.mx/reader034/viewer/2022052617/60acc15de1b66930b35efaf5/html5/thumbnails/22.jpg)
N arms, possibly very large N A d-dimensional context (feature vector) , for every arm , time Linear parametric model
Unknown parameter Expected reward for arm at time is , ⋅
Algorithm picks ∈ , , … , , , observes ⋅ Optimal arm depends on context: x∗ arg max
, , ⋅
Goal: Minimize regret Regret(T) = ∑ ∗ ⋅ ⋅
![Page 23: Thompson Sampling - Columbia Universitysa3305/ThompsonSampling.pdf · 2016. 7. 21. · 1dwxudo dqg (iilflhqw khxulvwlf 0dlqwdlq eholhi derxw hiihfwlyhqhvv phdq uhzdug ri hdfk dup](https://reader034.vdocuments.mx/reader034/viewer/2022052617/60acc15de1b66930b35efaf5/html5/thumbnails/23.jpg)
Least square solution of set of 1 equations ⋅ , 1, … , 1
≃ ∑ where ∑ ′ covariance matrix of this estimator
[A., Goyal 2013] 0, starting prior on , Reward distribution given , , : , , 1 , posterior on at time t is ,
![Page 24: Thompson Sampling - Columbia Universitysa3305/ThompsonSampling.pdf · 2016. 7. 21. · 1dwxudo dqg (iilflhqw khxulvwlf 0dlqwdlq eholhi derxw hiihfwlyhqhvv phdq uhzdug ri hdfk dup](https://reader034.vdocuments.mx/reader034/viewer/2022052617/60acc15de1b66930b35efaf5/html5/thumbnails/24.jpg)
Algorithm: At Step t, Sample from , Pull arm with feature where
max, , ⋅
Apply this algorithm for any likelihood , starting prior 0, !
![Page 25: Thompson Sampling - Columbia Universitysa3305/ThompsonSampling.pdf · 2016. 7. 21. · 1dwxudo dqg (iilflhqw khxulvwlf 0dlqwdlq eholhi derxw hiihfwlyhqhvv phdq uhzdug ri hdfk dup](https://reader034.vdocuments.mx/reader034/viewer/2022052617/60acc15de1b66930b35efaf5/html5/thumbnails/25.jpg)
With probability 1 , regret/
Any likelihood, unknown prior, only assumes bounded or sub-Gaussian noise No dependence on number of arm Lower bound Ω For UCB, best bound [Dani et al 2008, Abbasi-Yadkori et al 2011] Best earlier bound for a polynomial time algorithm / [Dani et al 2008]
![Page 26: Thompson Sampling - Columbia Universitysa3305/ThompsonSampling.pdf · 2016. 7. 21. · 1dwxudo dqg (iilflhqw khxulvwlf 0dlqwdlq eholhi derxw hiihfwlyhqhvv phdq uhzdug ri hdfk dup](https://reader034.vdocuments.mx/reader034/viewer/2022052617/60acc15de1b66930b35efaf5/html5/thumbnails/26.jpg)
∗∗∗
∗
![Page 27: Thompson Sampling - Columbia Universitysa3305/ThompsonSampling.pdf · 2016. 7. 21. · 1dwxudo dqg (iilflhqw khxulvwlf 0dlqwdlq eholhi derxw hiihfwlyhqhvv phdq uhzdug ri hdfk dup](https://reader034.vdocuments.mx/reader034/viewer/2022052617/60acc15de1b66930b35efaf5/html5/thumbnails/27.jpg)
Known likelihood Exponential families (with Jeffreys prior) [Korda et al. 2013]
Known prior (Bayesian regret) Near-optimal regret bounds for any prior [Russo and Van Roy 2013, 2014], [Bubeck
and Liu 2013] Extensions for many variations of MAB
side information, delayed feedback, sleeping bandits, sparse bandits, spectral bandits
![Page 28: Thompson Sampling - Columbia Universitysa3305/ThompsonSampling.pdf · 2016. 7. 21. · 1dwxudo dqg (iilflhqw khxulvwlf 0dlqwdlq eholhi derxw hiihfwlyhqhvv phdq uhzdug ri hdfk dup](https://reader034.vdocuments.mx/reader034/viewer/2022052617/60acc15de1b66930b35efaf5/html5/thumbnails/28.jpg)
Assortment selection as multi-armed bandit Arms are products Limited display space, k products at a time Challenge: Customer response on one product is influenced by other
products in assortment Arms are no longer independent
![Page 29: Thompson Sampling - Columbia Universitysa3305/ThompsonSampling.pdf · 2016. 7. 21. · 1dwxudo dqg (iilflhqw khxulvwlf 0dlqwdlq eholhi derxw hiihfwlyhqhvv phdq uhzdug ri hdfk dup](https://reader034.vdocuments.mx/reader034/viewer/2022052617/60acc15de1b66930b35efaf5/html5/thumbnails/29.jpg)
![Page 30: Thompson Sampling - Columbia Universitysa3305/ThompsonSampling.pdf · 2016. 7. 21. · 1dwxudo dqg (iilflhqw khxulvwlf 0dlqwdlq eholhi derxw hiihfwlyhqhvv phdq uhzdug ri hdfk dup](https://reader034.vdocuments.mx/reader034/viewer/2022052617/60acc15de1b66930b35efaf5/html5/thumbnails/30.jpg)
![Page 31: Thompson Sampling - Columbia Universitysa3305/ThompsonSampling.pdf · 2016. 7. 21. · 1dwxudo dqg (iilflhqw khxulvwlf 0dlqwdlq eholhi derxw hiihfwlyhqhvv phdq uhzdug ri hdfk dup](https://reader034.vdocuments.mx/reader034/viewer/2022052617/60acc15de1b66930b35efaf5/html5/thumbnails/31.jpg)
Multinomial logit choice model Probability of choosing product i (feature vector in assortment S
∑ Log ratio is linear in features
1-dimensional case [A., Avadhanula, Goyal, Zeevi, EC 2016]
1 ∑ Log ratio is constant
Independence of irrelevant alternatives
![Page 32: Thompson Sampling - Columbia Universitysa3305/ThompsonSampling.pdf · 2016. 7. 21. · 1dwxudo dqg (iilflhqw khxulvwlf 0dlqwdlq eholhi derxw hiihfwlyhqhvv phdq uhzdug ri hdfk dup](https://reader034.vdocuments.mx/reader034/viewer/2022052617/60acc15de1b66930b35efaf5/html5/thumbnails/32.jpg)
N products, Unknown parameters , , … ,At every step , recommend an assortment of size at most K, observe customer choice , revenue update parameter estimatesGoal: optimize total revenue ∑ or minimize regret compared to optimal assortment S∗ argmax ∑
[A., Avadhanula, Goyal, Zeevi, EC 2016]
![Page 33: Thompson Sampling - Columbia Universitysa3305/ThompsonSampling.pdf · 2016. 7. 21. · 1dwxudo dqg (iilflhqw khxulvwlf 0dlqwdlq eholhi derxw hiihfwlyhqhvv phdq uhzdug ri hdfk dup](https://reader034.vdocuments.mx/reader034/viewer/2022052617/60acc15de1b66930b35efaf5/html5/thumbnails/33.jpg)
Censored feedback Feedback for product effected by other products in assortment N possible assortments
• Getting unbiased estimate offer an assortment until no-purchase Number of times to is purchased is unbiased estimate of its parameter
Then, use standard UCB or Thompson Sampling techniques
![Page 34: Thompson Sampling - Columbia Universitysa3305/ThompsonSampling.pdf · 2016. 7. 21. · 1dwxudo dqg (iilflhqw khxulvwlf 0dlqwdlq eholhi derxw hiihfwlyhqhvv phdq uhzdug ri hdfk dup](https://reader034.vdocuments.mx/reader034/viewer/2022052617/60acc15de1b66930b35efaf5/html5/thumbnails/34.jpg)
UCB regret for 1-dimensional parameter
Assumes no purchase probability is highest Parameter independent, no dependence on K
regret [Ongoing work] Parameter c is a lower bound on gradient of choice probability with respect to any product parameter
Thompson Sampling Ongoing work, significantly more attractive empirical results
![Page 35: Thompson Sampling - Columbia Universitysa3305/ThompsonSampling.pdf · 2016. 7. 21. · 1dwxudo dqg (iilflhqw khxulvwlf 0dlqwdlq eholhi derxw hiihfwlyhqhvv phdq uhzdug ri hdfk dup](https://reader034.vdocuments.mx/reader034/viewer/2022052617/60acc15de1b66930b35efaf5/html5/thumbnails/35.jpg)
Budget/supply constraints, nonlinear utilities [A. and Devanur EC 2014] [A. and Devanur SODA 2015] [A., Devanur, Li, 2016] [A.
and Devanur 2016]
Exploring when your recommendations may not be followed Incentivizing selfish users to explore