Bayesian BanditsByron Galbraith, PhD
Cofounder / Chief Data Scientist, Talla2017.03.24
Bayesian Bandits for the Impatient
Online adaptive learning: βEarn while you Learnβ1
2
3
Powerful alternative to A/B testing optimization
Can be efficient and easy to implement
Dining Ware VR Experiences on Demand
Dining Ware VR Experiences on Demand
Iterated Decision Problems
What product recommendations should we present to subscribers to keep them engaged?
A/B Testing
Exploit vs Explore - What should we do?Choose what seems best so farπ Feel good about our decisionπ€ There still may be something better
Try something newπ Discover a superior approachπ§ Regret our choice
A/B/n Testing
Regret - What did that experiment cost us?
The Multi-Armed Bandit Problem
http://blog.yhat.com/posts/the-beer-bandit.html
Bandit Solutions
π π=βπ‘=1
π
[π (π π‘ (πβ ))βπ (π π‘ (ππ‘ )) ]
k-MAB =
ππ‘=argmaxπ [π ππ‘+
πβ log π‘ππ ]
π (π΄π‘=π )= πh ππ
βπ=1
π
πhππ=ππ‘ (π)
π (π=π₯ )=π₯πΌβ1 (1βπ₯ )π½β 1
π΅ (πΌ , π½ )π (π=π₯ )=(ππ₯ )ππ₯ (1βπ )πβπ₯
π΅ππ‘ππ(πΌ+ππ , π½+πβπ π)
π (π|π ,π )= π (π|π ,π )π ( π|π )π (π|π )
Thompson Sampling
π· (π½|π ,π )βπ· (π|π½ ,π) π· (π½β¨π )PriorLikeliho
odPosterior
Bayesian Bandits β The ModelModel if a recommendation will result in user engagement
β’ Bernoulli distribution: - likelihood of event occurring
How do we find ?β’ Conjugate priorβ’ Beta distribution: - number of hits, - number of missesπΌ π½
Only need to keep track of two numbers per optionβ’ # of hits, # of misses
Bayesian Bandits β The Algorithm1. Initialize (uniform prior)
2. For each user request for recommendations t1. Sample 2. Choose action corresponding to largest 3. Observe reward 4. Update
Belief Adaptation
Belief Adaptation
Belief Adaptation
Belief Adaptation
Belief Adaptation
Bandit Regret
But behavior is dependent on contextβ’ Categorical contextsβ’ One bandit model per categoryβ’ One-hot context vector
β’ Real-valued contextsβ’ Can capture interrelatedness of context dimensionsβ’ More difficult to incorporate effectively
So why would I ever A/B test again?Test intent
Optimization vs understanding
Difficulty with non-stationarityMonday vs Friday behavior
DeploymentFew turnkey optionsSpecialized skill set
https://vwo.com/blog/multi-armed-bandit-algorithm/
Bayesian Bandits for the PatientThompson Sampling balances exploitation & exploration while minimizing decision regret
1
2
3
No need to pre-specify decision splits, time horizon for experiments
Can model a variety of problems and complex interactions
Resourceshttps://github.com/bgalbraith/bandits