counterfactual risk minimization

Counterfactual Risk MinimizationLearning from logged bandit

feedbackAdith Swaminathan, Thorsten Joachims

Software: http://www.cs.cornell.edu/~adith/poem/

http://www.cs.cornell.edu/~adith/poem/

Learning frameworks

𝒙

𝒚Online Batch

Full Information Perceptron, … SVM, ...

Bandit Feedback LinUCB, … ?

2

Alice: Cool!

SpaceX launch

Logged bandit feedback is everywhere!

3

𝑥𝑖: Query

𝑥1 = (𝐴𝑙𝑖𝑐𝑒, 𝑡𝑒𝑐ℎ)𝑦1 = (𝑆𝑝𝑎𝑐𝑒𝑋)δ1 = 1

𝑦𝑖: Prediction

δ𝑖: Feedback

Alice: Want Tech news

Goal

• Risk of ℎ: 𝕏 ↦ 𝕐𝑅 ℎ = 𝔼𝑥 δ(𝑥, ℎ 𝑥 )

• Find ℎ∗ ∈ 𝓗 with minimum risk

• Can we find ℎ∗ using 𝒟 = collected from ℎ0?

4


𝑥2 = (𝐴𝑙𝑖𝑐𝑒, 𝑠𝑝𝑜𝑟𝑡𝑠)𝑦2 = (𝐹1)δ2 = 5

Learning by replaying logs?

• Training/evaluation from logged data is counter-factual [Bottou et al]

5


𝑥2 = (𝐴𝑙𝑖𝑐𝑒, 𝑠𝑝𝑜𝑟𝑡𝑠)𝑦2 = (𝐹1)δ2 = 5

𝑥1: (Alice,tech)

𝑦1′ : Apple watch

What would Alice do ??

𝑥 ∼ Pr(𝕏), however 𝑦 = ℎ0(𝑥), not ℎ(𝑥)

• Stochastic ℎ: 𝕏 ↦ ∆(𝕐), 𝑦 ∼ ℎ(𝑥)𝑅 ℎ = 𝔼𝑥𝔼𝑦∼ℎ(𝑥) δ(𝑥, 𝑦)

Stochastic policies to the rescue!

Likely

Unlikely

𝕐

SpaceX launch

6

ℎ0ℎ

Counterfactual risk estimators

• 𝒟 = { 𝑥1, 𝑦1, δ1, 𝑝1 , 𝑥2, 𝑦2, δ2, 𝑝2 , … , 𝑥𝑛, 𝑦𝑛, δ𝑛, 𝑝𝑛 }

• 𝑝𝑖 = ℎ0 𝑦𝑖 𝑥𝑖 … propensity [Rosenbaum et al]

𝑹𝓓 𝒉 =𝟏

𝒏

𝒊=𝟏

𝒏

𝜹𝒊𝒉(𝒚𝒊|𝒙𝒊)

𝒑𝒊

Basic Importance Sampling [Owen]

𝔼𝑥 𝔼𝑦∼ℎ δ 𝑥, 𝑦 = 𝔼𝑥 𝔼𝑦∼ℎ𝑜 δ(𝑥, 𝑦)ℎ(𝑦|𝑥)

ℎ0(𝑦|𝑥)Perf of new system Samples from old system Importance weight

7

Story so far

8

Logged bandit data

Fix ℎ 𝑣𝑠. ℎ0mismatch

Control variance

Tractable bound

Importance sampling causes non-uniform variance!

9

Logged bandit data


Control variance

Tractable bound

𝑥1 = (𝐴𝑙𝑖𝑐𝑒, 𝑠𝑝𝑜𝑟𝑡𝑠)𝑦1 = (𝐹1)𝛿1 = 5𝑝1 = 0.9

𝑥2 = (𝐴𝑙𝑖𝑐𝑒, 𝑡𝑒𝑐ℎ)𝑦2 = (𝑆𝑝𝑎𝑐𝑒𝑋)𝛿2 = 1𝑝2 = 0.9

𝑥3 = (𝐴𝑙𝑖𝑐𝑒,𝑚𝑜𝑣𝑖𝑒𝑠)𝑦3 = (𝑆𝑡𝑎𝑟 𝑊𝑎𝑟𝑠)𝛿3 = 2𝑝3 = 0.9

𝑥4 = (𝐴𝑙𝑖𝑐𝑒, 𝑡𝑒𝑐ℎ)𝑦4 = (𝑇𝑒𝑠𝑙𝑎)𝛿4 = 1𝑝4 = 0.9

ℎ1 𝑅𝓓 ℎ1 = 1

ℎ2 𝑅𝓓 ℎ2 = 1.33

Want: Error bound that captures variance of importance sampling

Capacity control

Counterfactual Risk Minimization

• W.h.p. in 𝒟 ∼ ℎ0

∀ℎ ∈ 𝓗, 𝑅(ℎ) ≤ 𝑅𝒟(ℎ) + 𝑂 𝑉𝑎𝑟𝒟 ℎ 𝑛 + 𝑂( 𝓝∞(𝓗) 𝑛)

*conditions apply. Refer [Maurer et al]

Empirical risk Variance regularization

Logged bandit data


Control variance

Tractable bound

10

Learning objective

ℎ𝐶𝑅𝑀 = argminℎ∈𝓗

𝑅𝒟 ℎ + λ 𝑉𝑎𝑟𝒟 ℎ

𝑛

POEM: CRM algorithm for structured prediction

• CRFs: ℎ𝑤 ∈ 𝓗𝑙𝑖𝑛; ℎ𝑤 𝑦 𝑥 =exp(𝑤ϕ(𝑥,𝑦))

ℤ(𝑥;𝑤)

• Policy Optimizer for Exponential Models :

𝑤∗ = argmin𝑤

1

𝑛

𝑖=1

𝑛

δ𝑖ℎ𝑤(𝑦𝑖|𝑥𝑖)

𝑝𝑖+ λ 𝑉𝑎𝑟(𝒉𝑤)

𝑛+ μ 𝑤 2

11

Good: Gradient descent, search over infinitely

many w

Bad: Not convex in w

Ugly: Resists stochastic

optimization

Logged bandit data


Control variance

Tractable bound

Stochastically optimize 𝑉𝑎𝑟(𝒉𝑤) ?

• Taylor-approximate!

𝑉𝑎𝑟(𝒉𝑤) ≤ 𝐴𝑤𝑡

𝑖=1

𝑛

ℎ𝑤𝑖 + 𝐵𝑤𝑡

𝑖=1

𝑛

ℎ𝑤𝑖 2 + 𝐶𝑤𝑡

• During epoch: Adagrad with 𝛻ℎ𝑤𝑖 + λ 𝑛(𝐴𝑤𝑡𝛻ℎ𝑤

𝑖 + 2𝐵𝑤𝑡ℎ𝑤𝑖 𝛻ℎ𝑤𝑖 )}

• After epoch: 𝑤𝑡+1 ↤ 𝑤, compute 𝐴𝑤𝑡+1 , 𝐵𝑤𝑡+1

𝑤𝑡

w𝑤𝑡+1

12

Logged bandit data


Control variance

Tractable bound

Experiment

• Supervised Bandit MultiLabel [Agarwal et al]

• δ 𝑥, 𝑦 = Hamming(𝑦∗ 𝑥 , 𝑦) (smaller is better)

• LibSVM Datasets• Scene (few features, labels and data)• Yeast (many labels)• LYRL (many features and data)• TMC (many features, labels and data)

• Validate hyper-params (λ,μ) using 𝑅𝒟𝑣𝑎𝑙(ℎ)

• Supervised test set expected Hamming loss

13

Approaches

• Baselines• ℎ0: Supervised CRF trained on 5% of training data

• Proposed• IPS (No variance penalty) (extends [Bottou et al])

• POEM

• Skylines• Supervised CRF (independent logit regression)

14

(1) Does variance regularization help?

15

1.543

5.547

1.463

3.445

1.519

4.614

1.118

3.023

1.143

4.517

0.996

2.522

0

1

2

3

4

5

6

Scene Yeast LYRL TMC

δ

Test set expected Hamming Loss

h0 IPS POEM Supervised CRF

0.659

2.822

0.222

1.189

(2) Is it efficient?

Avg Time (s) Scene Yeast LYRL TMC

POEM(B) 75.20 94.16 561.12 949.95

POEM(S) 4.71 5.02 120.09 276.13

CRF 4.86 3.28 62.93 99.18

16

• POEM recovers same performance at fraction of L-BFGS cost• Scales as supervised CRF, learns from bandit feedback

(3) Does generalization improve as 𝑛 → ∞?

17

2.75

3.25

3.75

4.25

1 2 4 8 16 32 64 128 256

δ

Replay Count

CRF h0 POEM

(4) Does stochasticity of ℎ0 affect learning?

18

0.8

1

1.2

1.4

1.6

0.5 1 2 4 8 16 32

δ

Param Multiplier

h0 POEM

Stochastic Deterministic

Conclusion

• CRM principle to learn from logged bandit feedback• Variance regularization

• POEM for structured output prediction• Scales as supervised CRF, learns from bandit feedback

• Contact: [email protected]

• POEM available at http://www.cs.cornell.edu/~adith/poem/

• Long paper: Counterfactual risk minimization – Learning from logged bandit feedback, http://jmlr.org/proceedings/papers/v37/swaminathan15.html

• Thanks!

19

mailto:[email protected]

http://www.cs.cornell.edu/~adith/poem/

http://jmlr.org/proceedings/papers/v37/swaminathan15.html

References1. Art B. Owen. 2013. Monte Carlo theory, methods and examples.

2. Paul R. Rosenbaum and Donald B. Rubin. 1983. The central role of the propensity score in observational studies for causal effects. Biometrika 70. 41-55.

3. Léon Bottou, Jonas Peters, Joaquin Quiñonero-Candela, Denis X. Charles, D. Max Chickering, Elon Portugaly, Dipankar Ray, Patrice Simard, and Ed Snelson. 2013. Counterfactual reasoning and learning systems: the example of computational advertising. J. Mach. Learn. Res. 14, 1, 3207-3260.

4. Alekh Agarwal, Daniel Hsu, Satyen Kale, John Langford, Lihong Li and Robert Schapire. 2014. Taming the Monster: A Fast and Simple Algorithm for Contextual Bandits. Proceedings of the 31st International Conference on Machine Learning. 1638-1646.

5. Andreas Maurer and Massimiliano Pontil. 2009. Empirical bernstein bounds and sample-variance penalization. Proceedings of the 22nd Conference on Learning Theory.

6. Adith Swaminathan and Thorsten Joachims. 2015. Counterfactual risk minimization: Learning from logged bandit feedback. Proceedings of the 32nd International Conference on Machine Learning.

20

counterfactual risk minimization

Documents