multi-armed bandit

Intuit Confidential and Proprietary 1

CTG Data Science LabAugust 17, 2016

Multi-armed Bandit ProblemPotential Improvement for DARTS

Aniruddha Bhargava, Yika Yujia Luo


Agenda

1. Problem Overview

2. Algorithms

Non-contextual cases

Contextual cases

3. Industry Review

4. Advanced Topics


Problem Overview


When do we run into Multi-armed Bandit Problem (MAB)?Gambling Research Funding

Clinical Trials Content Management


What is Multi-armed Bandit Problem (MAB)?

Goal: Pick the best restaurant efficiently

Logistics: Select a restaurant for each person, who leaves you a tip afterwards

$1 $8 $10

How?

$3 $6 $6Average: $2 Average: $7 Average: $6


MAB Terminology

Exploration: a learning process of people’s preferences, always involves a certain degree of randomness

Exploitation: use the current, reliable knowledge of a certain parameter to select a restaurant

Arm: restaurant

Expected Reward: Average tips in the end

Regret: expected tip loss after sending a person to a restaurant that is not the best

Policy: a strategy that you use to select restaurant

Total Cumulative Regret: the total tips you lose -- a performance measure for bandit algorithms

Expected: $1

Expected: $10

Regret is $9!

Expected: $8Regret is $9!

Regret is $2!

0 Regret!

Total regret: $20

User: People sent to restaurants

Reward: Tips

$0

$8

$2

$6


Big Picture

MAB Big Picture

DecisionMaking

OptimizationMAB

Choose the best product by finding the best restaurant to go

Minimize total regretby avoiding sending people to bad restaurants as much as possible


Algorithms(Non-contextual Cases)

“Anytime you are faced with the problem of both exploring and exploiting a search space, you have a bandit problem. Any method of solving that problem is a bandit algorithm”

-- Chris Stucchio


Non-Contextual

Non-contextual V.S. Contextual

User Product

IMPORTANT THING HERE: Although everyone has different taste, we pick one best restaurant for everyone


ε-greedy

Thompson Sampling

Upper Confidence Bound (UCB)

MAB Policies

There are more bandit algorithms… ...

A/B Testing

Adaptive


AB Testing

Person i Random100%

Exploration

33.3%

33.3%

33.3%

Exploitation

Person j100%


ε-greedy

Person i

Highest average tips

Random

20%

80%

Record person i’s feedback,

Update that restaurant’s average

tips value

Select (ε = 0.2)

Update33.3%

33.3%33.3%


Upper Confidence Bound (UCB)

Person iHighest upper

confidence bound Record person i’s

feedback,Update the upper confidence bound

of that restaurant’s average tips

Select

Update

Average tips from restaurant j #people went

to restaurant j

#people

100%


Thompson Sampling (Bayesian)

Person iHighest tips from

the sampling

Record person i’s feedback,

Update that restaurant’s average

tip distribution

Select

Update

Simulate 3 restaurants’average tip distribution,randomly draw a value from each distribution

SamplingMcDonald’s

Subway

Chili's

Average Tips($)

100%


Thompson Sampling (Bayesian)

Pr(r < b) = 10% Pr(r < b) = 0.01%


Algorithm Comparison

1. Exploration V.S Exploitation

2. Total Regret

3. Batch Update


Algorithm Comparison: Exploration V.S. Exploitation

IMPORTANT THING HERE: Exploration costs money!

Exp

lora

tion

(%)

Time (%)

75

50

25

0

100

25 50 75 100

AB Testing

εε-greedy

UCB/Thompson


Algorithm Comparison: Total Regret

M44%

S28%

C28%

AdaptiveAB Testing

M70%

S18%

C12%

Time Time


Algorithm Comparison: Batch Update

AB Testing ε-greedy UCB Thompson

Very Robust Depends Not Robust Robust

System UserQuestion

AnswerStore

ManyAnswers


Algorithm Comparison: Summary

AB Testing ε-greedy UCB Thompson

• Easy to implement

• If good ε found, lower total regret and faster to find best arm than ε-first

• Good for large amount of arms• Find the best arm fast • Low total regret

• Robust to batch update

Pros

Cons

• Easy to implement

• Good for small amount of arms

• Robust to batch update

• Not robust to batch update

• Sensitive to statistical assumptions

• High total regrets

• Need to figure out good ε

• High total regrets


ContextualNon-Contextual

Non-contextual V.S. Contextual

Female

Vegetarian

Married

Latino

Burger

Non-Vegetarian

Cheap

Good Service

User Product

IMPORTANT THING HERE: Everyone has different tastes, so we pick one best restaurant for each person


Agenda

1. Problem Overview

2. Algorithms


Contextual cases

3. Industry Review

4. Advanced Topics


Algorithms(Contextual Bandits)


What do we mean by context?

Likes spicy food, refined tastes, plays violin, Male, …

From Wisconsin, likes German food, likes Football, Male, …

Student, doesn’t like seafood, allergic to cats, Female, …

Chief of AFC, watches shows on competitive eating, Female, …

User side Arm side

Tex-Mex style, sit down dining,founded in 1975, …

Serves sandwiches, has veggie options, founded in 1965, …

Breakfast, lunch, and dinner, cheap, founded in 1940, …


User Context

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 250

2

4

6

8

10

12

14

16Average reward over time

Non-contextual Best possible without context Context (user) Best possible with context

Non-Contextual

User Context


Arm Context

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 250

2

4

6

8

10

12

14

16Average reward over time

Non-contextual Contextual (arm) Contextual (user)

Best possible without user context Best possible with user context Context (arm and user)

Non-contextualOnly arm context

Both arm and user context

User context can increase the optimal rewards;Arm context can get you there faster!

Takeaway Message


User side:

Population segmentation

e.g. DARTS

Clustering users

Learning embedding

Arms side:

Linear models:

LinUCB, Linear TS, OFUL

Maintain estimate of best arm

More data → shrink uncertainty

Exploiting Context


Assumptions:

• Users can be represented as points in space

• Users cluster together so that points that are close are similar

• Stationarity

Exploiting User Context



meat vegetarian

spic

ym

ild Joe

Yao

Nichola

PeterAniruddha

Rachel

SophieYika

Vineeta

Jason Andre

Chris

Madeline

John


Linear


meat vegetarian

spic

ym

ild Joe

Yao

Nichola

PeterAniruddha

Rachel

SophieYika

Vineeta

Jason Andre

Chris

Madeline

John



meat vegetarian

spic

ym

ild Joe

Yao

Nichola

PeterAniruddha

Rachel

SophieYika

Vineeta

Jason Andre

Chris

Madeline

John

Quadratic



meat vegetarian

spic

ym

ild Joe

Yao

Nichola

PeterAniruddha

Rachel

SophieYika

Vineeta

Jason Andre

Chris

Madeline

John

40% 35% 25%

Hierarchical



meat vegetarian

spic

ym

ild Joe

Yao

Nichola

PeterAniruddha

Rachel

SophieYika

Vineeta

Jason Andre

Chris

Madeline

John

80% 15% 5%

5% 15% 80%

Hierarchical



meat vegetarian

spic

ym

ild Joe

Yao

Nichola

PeterAniruddha

Rachel

SophieYika

Vineeta

Jason Andre

Chris

Madeline

John

5% 50% 45%

80% 15% 5%

5% 10% 85%

15% 80% 5%

Hierarchical


80% 15% 5%


meat vegetarian

spic

ym

ild Joe

Yao

Nichola

PeterAniruddha

Rachel

SophieYika

Vineeta

Jason Andre

Chris

Madeline

John

5% 5% 90%10% 45% 45%

5% 50% 45%

15% 80% 5%

Hierarchical


Assumptions:• We can represent arms as vectors.• Rewards are a noisy version of the inner product.• Stationarity.

Look at only arm context and no user context

Methods include:• Linear UCB• Linear Thompson Sampling• OFUL (Optimism in the Face of Uncertainty – Linear)• ... and many more.

Linear modelsExploiting Arm Context


The Math Slide

Standard noisy linear model:rt = xtTθ* + ηt

θ* : the optimal armxt : arm pulled at time trt : reward at time t

ηt : noise at time t

Ct : confidence set

λ : ridge termXt : matrix of all arms pulled till time t

Collect all data and write:r = X θ* + η

Least Squares Solution: θLS = (XTX)-1 XTr

Ridge regression: θLSR = (XTX + λI)-1 XTr

Typical Linear Bandit algorithm:θ0 = 0t = 0,1,2,…

xt = argmaxx∈Ct (xTθt )

θt = (XtTXt + λI)-1 Xt

Trt


Exploiting Arm Context Arms

Optimal arm

meat vegetarian

spic

ym

ild

Mince pie

Buffalo wings

Tofu scramble

Grilledvegetables

Ratatouille

Tandoori Chicken

Jalapeno scramble

Pad Thai

Penne Arrabiata

Set of Armsx1, x2, …

θ* : the optimal arm


Exploiting Arm Context Arms

Optimal arm

Next armchosen

Reward (=cos(θ)) is small, but we can still infer information about other arms!

Buffalo wings

θ


Exploiting Arm Context

C1

θ1

Arms

Optimal arm

Next armchosen

Estimate of optimal armRegion ofuncertainty



We’ve already honed in on a pretty good choice

x2

Arms

Optimal arm

Next armchosen




And the process continues …

C2

θ2

Arms

Optimal arm

Next armchosen



• Big assumption that we know good features.

• Finding features takes a lot of work.

• Few arms, many people → learn an embedding of arms

• Few people, many arms → Featurize, linear bandits

• Linear models are a naive assumption, see kernel methods.

Some Caveats


Agenda

1. Problem Overview

2. Algorithms


Contextual cases

3. Industry Review

4. Advanced Topics


Industry Review


Companies using MAB


Headlines, Photos and Ads

Washington Post Google


Used Upper Confidence Bound (UCB) to picking headlines and photos

Washington Post


Google ExperimentsUsed Thompson Sampling (TS)Updated models twice a dayTwo metrics used to gauge end of experiment:

• 95% confidence that alternate better or …• "potential value remaining in the experiment”

The more arms the higher the gain over A/B testing.

Takeaway Message


Advanced Topics


Biasing

Data Joining and Latency

Non-stationary

Topics


Bias

Website 1 Website 2

50% 50%Probability

Numbersold

100 20

90% 10%Probability

Numbersold

100 20

Who did better?


• Be careful when using past data!

• Inverse Propensity Score Matching

• New sales estimates:

Bias

Website 1: 100*0.5+20*0.5 = 60

Website 2: 100*0.5*(0.5/0.9) + 20*0.5*(0.5/0.1) = 75


Data Joining and Latency

Courtesy: Microsoft MWT white paper

Context, decision

RewardsLatency


Non-Stationarity – Beer example

January April July October December

Stouts and porters

Pale Ales and IPAs

Wits and Lagers

Oktoberfests and Reds

Christmas Ales

My yearly beer taste:


Preferences change over time.

There may be periodicity in data, Tax season is a great example.

Some solutions:

• Slow changes → System with finite memory

• Abrupt changes → Subspace tracking/anomaly detection

Non-Stationarity

Preferences change over time, biases are added and data

needs to be joined from different sources.

Takeaway Message


Thank You.Questions?

multi-armed bandit

Documents