multi-armed bandit
TRANSCRIPT
Intuit Confidential and Proprietary 1
CTG Data Science LabAugust 17, 2016
Multi-armed Bandit ProblemPotential Improvement for DARTS
Aniruddha Bhargava, Yika Yujia Luo
Intuit Confidential and Proprietary 2
Agenda
1. Problem Overview
2. Algorithms
Non-contextual cases
Contextual cases
3. Industry Review
4. Advanced Topics
Intuit Confidential and Proprietary 3
Problem Overview
Intuit Confidential and Proprietary 4
When do we run into Multi-armed Bandit Problem (MAB)?Gambling Research Funding
Clinical Trials Content Management
Intuit Confidential and Proprietary 5
What is Multi-armed Bandit Problem (MAB)?
Goal: Pick the best restaurant efficiently
Logistics: Select a restaurant for each person, who leaves you a tip afterwards
$1 $8 $10
How?
$3 $6 $6Average: $2 Average: $7 Average: $6
Intuit Confidential and Proprietary 6
MAB Terminology
Exploration: a learning process of people’s preferences, always involves a certain degree of randomness
Exploitation: use the current, reliable knowledge of a certain parameter to select a restaurant
Arm: restaurant
Expected Reward: Average tips in the end
Regret: expected tip loss after sending a person to a restaurant that is not the best
Policy: a strategy that you use to select restaurant
Total Cumulative Regret: the total tips you lose -- a performance measure for bandit algorithms
Expected: $1
Expected: $10
Regret is $9!
Expected: $8Regret is $9!
Regret is $2!
0 Regret!
Total regret: $20
User: People sent to restaurants
Reward: Tips
$0
$8
$2
$6
Intuit Confidential and Proprietary 7
Big Picture
MAB Big Picture
DecisionMaking
OptimizationMAB
Choose the best product by finding the best restaurant to go
Minimize total regretby avoiding sending people to bad restaurants as much as possible
Intuit Confidential and Proprietary 8
Algorithms(Non-contextual Cases)
“Anytime you are faced with the problem of both exploring and exploiting a search space, you have a bandit problem. Any method of solving that problem is a bandit algorithm”
-- Chris Stucchio
Intuit Confidential and Proprietary 9
Non-Contextual
Non-contextual V.S. Contextual
User Product
IMPORTANT THING HERE: Although everyone has different taste, we pick one best restaurant for everyone
Intuit Confidential and Proprietary 10
ε-greedy
Thompson Sampling
Upper Confidence Bound (UCB)
MAB Policies
There are more bandit algorithms… ...
A/B Testing
Adaptive
Intuit Confidential and Proprietary 11
AB Testing
Person i Random100%
Exploration
33.3%
33.3%
33.3%
Exploitation
Person j100%
Intuit Confidential and Proprietary 12
ε-greedy
Person i
Highest average tips
Random
20%
80%
Record person i’s feedback,
Update that restaurant’s average
tips value
Select (ε = 0.2)
Update33.3%
33.3%33.3%
Intuit Confidential and Proprietary 13
Upper Confidence Bound (UCB)
Person iHighest upper
confidence bound Record person i’s
feedback,Update the upper confidence bound
of that restaurant’s average tips
Select
Update
Average tips from restaurant j #people went
to restaurant j
#people
100%
Intuit Confidential and Proprietary 14
Thompson Sampling (Bayesian)
Person iHighest tips from
the sampling
Record person i’s feedback,
Update that restaurant’s average
tip distribution
Select
Update
Simulate 3 restaurants’average tip distribution,randomly draw a value from each distribution
SamplingMcDonald’s
Subway
Chili's
Average Tips($)
100%
Intuit Confidential and Proprietary 15
Thompson Sampling (Bayesian)
Pr(r < b) = 10% Pr(r < b) = 0.01%
Intuit Confidential and Proprietary 16
Algorithm Comparison
1. Exploration V.S Exploitation
2. Total Regret
3. Batch Update
Intuit Confidential and Proprietary 17
Algorithm Comparison: Exploration V.S. Exploitation
IMPORTANT THING HERE: Exploration costs money!
Exp
lora
tion
(%)
Time (%)
75
50
25
0
100
25 50 75 100
AB Testing
εε-greedy
UCB/Thompson
Intuit Confidential and Proprietary 18
Algorithm Comparison: Total Regret
M44%
S28%
C28%
AdaptiveAB Testing
M70%
S18%
C12%
Time Time
Intuit Confidential and Proprietary 19
Algorithm Comparison: Batch Update
AB Testing ε-greedy UCB Thompson
Very Robust Depends Not Robust Robust
System UserQuestion
AnswerStore
ManyAnswers
Intuit Confidential and Proprietary 20
Algorithm Comparison: Summary
AB Testing ε-greedy UCB Thompson
• Easy to implement
• If good ε found, lower total regret and faster to find best arm than ε-first
• Good for large amount of arms• Find the best arm fast • Low total regret
• Robust to batch update
Pros
Cons
• Easy to implement
• Good for small amount of arms
• Robust to batch update
• Not robust to batch update
• Sensitive to statistical assumptions
• High total regrets
• Need to figure out good ε
• High total regrets
Intuit Confidential and Proprietary 21
ContextualNon-Contextual
Non-contextual V.S. Contextual
Female
Vegetarian
Married
Latino
Burger
Non-Vegetarian
Cheap
Good Service
User Product
IMPORTANT THING HERE: Everyone has different tastes, so we pick one best restaurant for each person
Intuit Confidential and Proprietary 22
Agenda
1. Problem Overview
2. Algorithms
Non-contextual cases
Contextual cases
3. Industry Review
4. Advanced Topics
Intuit Confidential and Proprietary 23
Algorithms(Contextual Bandits)
Intuit Confidential and Proprietary 24
What do we mean by context?
Likes spicy food, refined tastes, plays violin, Male, …
From Wisconsin, likes German food, likes Football, Male, …
Student, doesn’t like seafood, allergic to cats, Female, …
Chief of AFC, watches shows on competitive eating, Female, …
User side Arm side
Tex-Mex style, sit down dining,founded in 1975, …
Serves sandwiches, has veggie options, founded in 1965, …
Breakfast, lunch, and dinner, cheap, founded in 1940, …
Intuit Confidential and Proprietary 25
User Context
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 250
2
4
6
8
10
12
14
16Average reward over time
Non-contextual Best possible without context Context (user) Best possible with context
Non-Contextual
User Context
Intuit Confidential and Proprietary 26
Arm Context
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 250
2
4
6
8
10
12
14
16Average reward over time
Non-contextual Contextual (arm) Contextual (user)
Best possible without user context Best possible with user context Context (arm and user)
Non-contextualOnly arm context
Both arm and user context
User context can increase the optimal rewards;Arm context can get you there faster!
Takeaway Message
Intuit Confidential and Proprietary 28
User side:
Population segmentation
e.g. DARTS
Clustering users
Learning embedding
Arms side:
Linear models:
LinUCB, Linear TS, OFUL
Maintain estimate of best arm
More data → shrink uncertainty
Exploiting Context
Intuit Confidential and Proprietary 29
Assumptions:
• Users can be represented as points in space
• Users cluster together so that points that are close are similar
• Stationarity
Exploiting User Context
Intuit Confidential and Proprietary 30
Exploiting User Context
meat vegetarian
spic
ym
ild Joe
Yao
Nichola
PeterAniruddha
Rachel
SophieYika
Vineeta
Jason Andre
Chris
Madeline
John
Intuit Confidential and Proprietary 31
Linear
Exploiting User Context
meat vegetarian
spic
ym
ild Joe
Yao
Nichola
PeterAniruddha
Rachel
SophieYika
Vineeta
Jason Andre
Chris
Madeline
John
Intuit Confidential and Proprietary 32
Exploiting User Context
meat vegetarian
spic
ym
ild Joe
Yao
Nichola
PeterAniruddha
Rachel
SophieYika
Vineeta
Jason Andre
Chris
Madeline
John
Quadratic
Intuit Confidential and Proprietary 33
Exploiting User Context
meat vegetarian
spic
ym
ild Joe
Yao
Nichola
PeterAniruddha
Rachel
SophieYika
Vineeta
Jason Andre
Chris
Madeline
John
40% 35% 25%
Hierarchical
Intuit Confidential and Proprietary 34
Exploiting User Context
meat vegetarian
spic
ym
ild Joe
Yao
Nichola
PeterAniruddha
Rachel
SophieYika
Vineeta
Jason Andre
Chris
Madeline
John
80% 15% 5%
5% 15% 80%
Hierarchical
Intuit Confidential and Proprietary 35
Exploiting User Context
meat vegetarian
spic
ym
ild Joe
Yao
Nichola
PeterAniruddha
Rachel
SophieYika
Vineeta
Jason Andre
Chris
Madeline
John
5% 50% 45%
80% 15% 5%
5% 10% 85%
15% 80% 5%
Hierarchical
Intuit Confidential and Proprietary 36
80% 15% 5%
Exploiting User Context
meat vegetarian
spic
ym
ild Joe
Yao
Nichola
PeterAniruddha
Rachel
SophieYika
Vineeta
Jason Andre
Chris
Madeline
John
5% 5% 90%10% 45% 45%
5% 50% 45%
15% 80% 5%
Hierarchical
Intuit Confidential and Proprietary 37
Assumptions:• We can represent arms as vectors.• Rewards are a noisy version of the inner product.• Stationarity.
Look at only arm context and no user context
Methods include:• Linear UCB• Linear Thompson Sampling• OFUL (Optimism in the Face of Uncertainty – Linear)• ... and many more.
Linear modelsExploiting Arm Context
Intuit Confidential and Proprietary 38
The Math Slide
Standard noisy linear model:rt = xtTθ* + ηt
θ* : the optimal armxt : arm pulled at time trt : reward at time t
ηt : noise at time t
Ct : confidence set
λ : ridge termXt : matrix of all arms pulled till time t
Collect all data and write:r = X θ* + η
Least Squares Solution: θLS = (XTX)-1 XTr
Ridge regression: θLSR = (XTX + λI)-1 XTr
Typical Linear Bandit algorithm:θ0 = 0t = 0,1,2,…
xt = argmaxx∈Ct (xTθt )
θt = (XtTXt + λI)-1 Xt
Trt
Intuit Confidential and Proprietary 39
Exploiting Arm Context Arms
Optimal arm
meat vegetarian
spic
ym
ild
Mince pie
Buffalo wings
Tofu scramble
Grilledvegetables
Ratatouille
Tandoori Chicken
Jalapeno scramble
Pad Thai
Penne Arrabiata
Set of Armsx1, x2, …
θ* : the optimal arm
Intuit Confidential and Proprietary 40
Exploiting Arm Context Arms
Optimal arm
Next armchosen
Reward (=cos(θ)) is small, but we can still infer information about other arms!
Buffalo wings
θ
Intuit Confidential and Proprietary 41
Exploiting Arm Context
C1
θ1
Arms
Optimal arm
Next armchosen
Estimate of optimal armRegion ofuncertainty
Intuit Confidential and Proprietary 42
Exploiting Arm Context
We’ve already honed in on a pretty good choice
x2
Arms
Optimal arm
Next armchosen
Estimate of optimal armRegion ofuncertainty
Intuit Confidential and Proprietary 43
Exploiting Arm Context
And the process continues …
C2
θ2
Arms
Optimal arm
Next armchosen
Estimate of optimal armRegion ofuncertainty
Intuit Confidential and Proprietary 44
• Big assumption that we know good features.
• Finding features takes a lot of work.
• Few arms, many people → learn an embedding of arms
• Few people, many arms → Featurize, linear bandits
• Linear models are a naive assumption, see kernel methods.
Some Caveats
Intuit Confidential and Proprietary 45
Agenda
1. Problem Overview
2. Algorithms
Non-contextual cases
Contextual cases
3. Industry Review
4. Advanced Topics
Intuit Confidential and Proprietary 46
Industry Review
Intuit Confidential and Proprietary 47
Companies using MAB
Intuit Confidential and Proprietary 48
Headlines, Photos and Ads
Washington Post Google
Intuit Confidential and Proprietary 49
Used Upper Confidence Bound (UCB) to picking headlines and photos
Washington Post
Intuit Confidential and Proprietary 50
Google ExperimentsUsed Thompson Sampling (TS)Updated models twice a dayTwo metrics used to gauge end of experiment:
• 95% confidence that alternate better or …• "potential value remaining in the experiment”
The more arms the higher the gain over A/B testing.
Takeaway Message
Intuit Confidential and Proprietary 52
Advanced Topics
Intuit Confidential and Proprietary 53
Biasing
Data Joining and Latency
Non-stationary
Topics
Intuit Confidential and Proprietary 54
Bias
Website 1 Website 2
50% 50%Probability
Numbersold
100 20
90% 10%Probability
Numbersold
100 20
Who did better?
Intuit Confidential and Proprietary 55
• Be careful when using past data!
• Inverse Propensity Score Matching
• New sales estimates:
Bias
Website 1: 100*0.5+20*0.5 = 60
Website 2: 100*0.5*(0.5/0.9) + 20*0.5*(0.5/0.1) = 75
Intuit Confidential and Proprietary 56
Data Joining and Latency
Courtesy: Microsoft MWT white paper
Context, decision
RewardsLatency
Intuit Confidential and Proprietary 57
Non-Stationarity – Beer example
January April July October December
Stouts and porters
Pale Ales and IPAs
Wits and Lagers
Oktoberfests and Reds
Christmas Ales
My yearly beer taste:
Intuit Confidential and Proprietary 58
Preferences change over time.
There may be periodicity in data, Tax season is a great example.
Some solutions:
• Slow changes → System with finite memory
• Abrupt changes → Subspace tracking/anomaly detection
Non-Stationarity
Preferences change over time, biases are added and data
needs to be joined from different sources.
Takeaway Message
Intuit Confidential and Proprietary 60
Thank You.Questions?