byron galbraith, chief data scientist, talla, at mlconf nyc 2017

Bayesian BanditsByron Galbraith, PhD

Cofounder / Chief Data Scientist, Talla2017.03.24

Bayesian Bandits for the Impatient

Online adaptive learning: “Earn while you Learn”1

2

3

Powerful alternative to A/B testing optimization

Can be efficient and easy to implement

Dining Ware VR Experiences on Demand

Iterated Decision Problems

What product recommendations should we present to subscribers to keep them engaged?

A/B Testing

Exploit vs Explore - What should we do?Choose what seems best so far🙂 Feel good about our decision🤔 There still may be something better

Try something new😄 Discover a superior approach😧 Regret our choice

A/B/n Testing

Regret - What did that experiment cost us?

The Multi-Armed Bandit Problem

http://blog.yhat.com/posts/the-beer-bandit.html

Bandit Solutions

𝑅𝑇=∑𝑡=1

𝑇

[𝑟 (𝑌 𝑡 (𝑎∗ ))−𝑟 (𝑌 𝑡 (𝑎𝑡 )) ]

k-MAB =

𝑎𝑡=argmax𝑖 [𝑟 𝑖𝑡+

𝑐√ log 𝑡𝑛𝑖 ]

𝑃 (𝐴𝑡=𝑎 )= 𝑒h 𝑎𝑛

∑𝑏=1

𝑘

𝑒h𝑏𝑛=𝜋𝑡 (𝑎)

𝑃 (𝑋=𝑥 )=𝑥𝛼−1 (1−𝑥 )𝛽− 1

𝐵 (𝛼 , 𝛽 )𝑃 (𝑋=𝑥 )=(𝑛𝑥 )𝑝𝑥 (1−𝑝 )𝑛−𝑥

𝐵𝑒𝑡𝑎𝑎(𝛼+𝑟𝑎 , 𝛽+𝑁−𝑟 𝑎)

𝑃 (𝑋|𝑌 ,𝑍 )= 𝑃 (𝑌|𝑋 ,𝑍 )𝑃 ( 𝑋|𝑍 )𝑃 (𝑌|𝑍 )

Thompson Sampling

𝑷 (𝜽|𝒓 ,𝒂 )∝𝑷 (𝒓|𝜽 ,𝒂) 𝑷 (𝜽∨𝒂 )PriorLikeliho

odPosterior

Bayesian Bandits – The ModelModel if a recommendation will result in user engagement

• Bernoulli distribution: - likelihood of event occurring

How do we find ?• Conjugate prior• Beta distribution: - number of hits, - number of misses𝛼 𝛽

Only need to keep track of two numbers per option• # of hits, # of misses

Bayesian Bandits – The Algorithm1. Initialize (uniform prior)

2. For each user request for recommendations t1. Sample 2. Choose action corresponding to largest 3. Observe reward 4. Update

Belief Adaptation

Bandit Regret

But behavior is dependent on context• Categorical contexts• One bandit model per category• One-hot context vector

• Real-valued contexts• Can capture interrelatedness of context dimensions• More difficult to incorporate effectively

So why would I ever A/B test again?Test intent

Optimization vs understanding

Difficulty with non-stationarityMonday vs Friday behavior

DeploymentFew turnkey optionsSpecialized skill set

https://vwo.com/blog/multi-armed-bandit-algorithm/

Bayesian Bandits for the PatientThompson Sampling balances exploitation & exploration while minimizing decision regret

1

2

3

No need to pre-specify decision splits, time horizon for experiments

Can model a variety of problems and complex interactions

Resourceshttps://github.com/bgalbraith/bandits

byron galbraith, chief data scientist, talla, at mlconf nyc 2017

Technology