counterfactual risk minimization

20
Counterfactual Risk Minimization Learning from logged bandit feedback Adith Swaminathan , Thorsten Joachims Software: http ://www.cs.cornell.edu/~adith/poem/

Upload: others

Post on 18-Oct-2021

13 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Counterfactual Risk Minimization

Counterfactual Risk MinimizationLearning from logged bandit

feedbackAdith Swaminathan, Thorsten Joachims

Software: http://www.cs.cornell.edu/~adith/poem/

Page 2: Counterfactual Risk Minimization

Learning frameworks

𝒙

π’šOnline Batch

Full Information Perceptron, … SVM, ...

Bandit Feedback LinUCB, … ?

2

Page 3: Counterfactual Risk Minimization

Alice: Cool!

SpaceX launch

Logged bandit feedback is everywhere!

3

π‘₯𝑖: Query

π‘₯1 = (𝐴𝑙𝑖𝑐𝑒, π‘‘π‘’π‘β„Ž)𝑦1 = (π‘†π‘π‘Žπ‘π‘’π‘‹)Ξ΄1 = 1

𝑦𝑖: Prediction

δ𝑖: Feedback

Alice: Want Tech news

Page 4: Counterfactual Risk Minimization

Goal

β€’ Risk of β„Ž: 𝕏 ↦ 𝕐𝑅 β„Ž = 𝔼π‘₯ Ξ΄(π‘₯, β„Ž π‘₯ )

β€’ Find β„Žβˆ— ∈ 𝓗 with minimum risk

β€’ Can we find β„Žβˆ— using π’Ÿ = collected from β„Ž0?

4

π‘₯1 = (𝐴𝑙𝑖𝑐𝑒, π‘‘π‘’π‘β„Ž)𝑦1 = (π‘†π‘π‘Žπ‘π‘’π‘‹)Ξ΄1 = 1

π‘₯2 = (𝐴𝑙𝑖𝑐𝑒, π‘ π‘π‘œπ‘Ÿπ‘‘π‘ )𝑦2 = (𝐹1)Ξ΄2 = 5

Page 5: Counterfactual Risk Minimization

Learning by replaying logs?

β€’ Training/evaluation from logged data is counter-factual [Bottou et al]

5

π‘₯1 = (𝐴𝑙𝑖𝑐𝑒, π‘‘π‘’π‘β„Ž)𝑦1 = (π‘†π‘π‘Žπ‘π‘’π‘‹)Ξ΄1 = 1

π‘₯2 = (𝐴𝑙𝑖𝑐𝑒, π‘ π‘π‘œπ‘Ÿπ‘‘π‘ )𝑦2 = (𝐹1)Ξ΄2 = 5

π‘₯1: (Alice,tech)

𝑦1β€² : Apple watch

What would Alice do ??

π‘₯ ∼ Pr(𝕏), however 𝑦 = β„Ž0(π‘₯), not β„Ž(π‘₯)

Page 6: Counterfactual Risk Minimization

β€’ Stochastic β„Ž: 𝕏 ↦ βˆ†(𝕐), 𝑦 ∼ β„Ž(π‘₯)𝑅 β„Ž = 𝔼π‘₯π”Όπ‘¦βˆΌβ„Ž(π‘₯) Ξ΄(π‘₯, 𝑦)

Stochastic policies to the rescue!

Likely

Unlikely

𝕐

SpaceX launch

6

β„Ž0β„Ž

Page 7: Counterfactual Risk Minimization

Counterfactual risk estimators

β€’ π’Ÿ = { π‘₯1, 𝑦1, Ξ΄1, 𝑝1 , π‘₯2, 𝑦2, Ξ΄2, 𝑝2 , … , π‘₯𝑛, 𝑦𝑛, δ𝑛, 𝑝𝑛 }

β€’ 𝑝𝑖 = β„Ž0 𝑦𝑖 π‘₯𝑖 … propensity [Rosenbaum et al]

𝑹𝓓 𝒉 =𝟏

𝒏

π’Š=𝟏

𝒏

πœΉπ’Šπ’‰(π’šπ’Š|π’™π’Š)

π’‘π’Š

Basic Importance Sampling [Owen]

𝔼π‘₯ π”Όπ‘¦βˆΌβ„Ž Ξ΄ π‘₯, 𝑦 = 𝔼π‘₯ π”Όπ‘¦βˆΌβ„Žπ‘œ Ξ΄(π‘₯, 𝑦)β„Ž(𝑦|π‘₯)

β„Ž0(𝑦|π‘₯)Perf of new system Samples from old system Importance weight

7

Page 8: Counterfactual Risk Minimization

Story so far

8

Logged bandit data

Fix β„Ž 𝑣𝑠. β„Ž0mismatch

Control variance

Tractable bound

Page 9: Counterfactual Risk Minimization

Importance sampling causes non-uniform variance!

9

Logged bandit data

Fix β„Ž 𝑣𝑠. β„Ž0mismatch

Control variance

Tractable bound

π‘₯1 = (𝐴𝑙𝑖𝑐𝑒, π‘ π‘π‘œπ‘Ÿπ‘‘π‘ )𝑦1 = (𝐹1)𝛿1 = 5𝑝1 = 0.9

π‘₯2 = (𝐴𝑙𝑖𝑐𝑒, π‘‘π‘’π‘β„Ž)𝑦2 = (π‘†π‘π‘Žπ‘π‘’π‘‹)𝛿2 = 1𝑝2 = 0.9

π‘₯3 = (𝐴𝑙𝑖𝑐𝑒,π‘šπ‘œπ‘£π‘–π‘’π‘ )𝑦3 = (π‘†π‘‘π‘Žπ‘Ÿ π‘Šπ‘Žπ‘Ÿπ‘ )𝛿3 = 2𝑝3 = 0.9

π‘₯4 = (𝐴𝑙𝑖𝑐𝑒, π‘‘π‘’π‘β„Ž)𝑦4 = (π‘‡π‘’π‘ π‘™π‘Ž)𝛿4 = 1𝑝4 = 0.9

β„Ž1 𝑅𝓓 β„Ž1 = 1

β„Ž2 𝑅𝓓 β„Ž2 = 1.33

Want: Error bound that captures variance of importance sampling

Page 10: Counterfactual Risk Minimization

Capacity control

Counterfactual Risk Minimization

β€’ W.h.p. in π’Ÿ ∼ β„Ž0

βˆ€β„Ž ∈ 𝓗, 𝑅(β„Ž) ≀ π‘…π’Ÿ(β„Ž) + 𝑂 π‘‰π‘Žπ‘Ÿπ’Ÿ β„Ž 𝑛 + 𝑂( π“βˆž(𝓗) 𝑛)

*conditions apply. Refer [Maurer et al]

Empirical risk Variance regularization

Logged bandit data

Fix β„Ž 𝑣𝑠. β„Ž0mismatch

Control variance

Tractable bound

10

Learning objective

β„ŽπΆπ‘…π‘€ = argminβ„Žβˆˆπ“—

π‘…π’Ÿ β„Ž + Ξ» π‘‰π‘Žπ‘Ÿπ’Ÿ β„Ž

𝑛

Page 11: Counterfactual Risk Minimization

POEM: CRM algorithm for structured prediction

β€’ CRFs: β„Žπ‘€ ∈ 𝓗𝑙𝑖𝑛; β„Žπ‘€ 𝑦 π‘₯ =exp(𝑀ϕ(π‘₯,𝑦))

β„€(π‘₯;𝑀)

β€’ Policy Optimizer for Exponential Models :

π‘€βˆ— = argmin𝑀

1

𝑛

𝑖=1

𝑛

Ξ΄π‘–β„Žπ‘€(𝑦𝑖|π‘₯𝑖)

𝑝𝑖+ Ξ» π‘‰π‘Žπ‘Ÿ(𝒉𝑀)

𝑛+ ΞΌ 𝑀 2

11

Good: Gradient descent, search over infinitely

many w

Bad: Not convex in w

Ugly: Resists stochastic

optimization

Logged bandit data

Fix β„Ž 𝑣𝑠. β„Ž0mismatch

Control variance

Tractable bound

Page 12: Counterfactual Risk Minimization

Stochastically optimize π‘‰π‘Žπ‘Ÿ(𝒉𝑀) ?

β€’ Taylor-approximate!

π‘‰π‘Žπ‘Ÿ(𝒉𝑀) ≀ 𝐴𝑀𝑑

𝑖=1

𝑛

β„Žπ‘€π‘– + 𝐡𝑀𝑑

𝑖=1

𝑛

β„Žπ‘€π‘– 2 + 𝐢𝑀𝑑

β€’ During epoch: Adagrad with π›»β„Žπ‘€π‘– + Ξ» 𝑛(π΄π‘€π‘‘π›»β„Žπ‘€

𝑖 + 2π΅π‘€π‘‘β„Žπ‘€π‘– π›»β„Žπ‘€π‘– )}

β€’ After epoch: 𝑀𝑑+1 ↀ 𝑀, compute 𝐴𝑀𝑑+1 , 𝐡𝑀𝑑+1

𝑀𝑑

w𝑀𝑑+1

12

Logged bandit data

Fix β„Ž 𝑣𝑠. β„Ž0mismatch

Control variance

Tractable bound

Page 13: Counterfactual Risk Minimization

Experiment

β€’ Supervised Bandit MultiLabel [Agarwal et al]

β€’ Ξ΄ π‘₯, 𝑦 = Hamming(π‘¦βˆ— π‘₯ , 𝑦) (smaller is better)

β€’ LibSVM Datasetsβ€’ Scene (few features, labels and data)β€’ Yeast (many labels)β€’ LYRL (many features and data)β€’ TMC (many features, labels and data)

β€’ Validate hyper-params (Ξ»,ΞΌ) using π‘…π’Ÿπ‘£π‘Žπ‘™(β„Ž)

β€’ Supervised test set expected Hamming loss

13

Page 14: Counterfactual Risk Minimization

Approaches

β€’ Baselinesβ€’ β„Ž0: Supervised CRF trained on 5% of training data

β€’ Proposedβ€’ IPS (No variance penalty) (extends [Bottou et al])

β€’ POEM

β€’ Skylinesβ€’ Supervised CRF (independent logit regression)

14

Page 15: Counterfactual Risk Minimization

(1) Does variance regularization help?

15

1.543

5.547

1.463

3.445

1.519

4.614

1.118

3.023

1.143

4.517

0.996

2.522

0

1

2

3

4

5

6

Scene Yeast LYRL TMC

Ξ΄

Test set expected Hamming Loss

h0 IPS POEM Supervised CRF

0.659

2.822

0.222

1.189

Page 16: Counterfactual Risk Minimization

(2) Is it efficient?

Avg Time (s) Scene Yeast LYRL TMC

POEM(B) 75.20 94.16 561.12 949.95

POEM(S) 4.71 5.02 120.09 276.13

CRF 4.86 3.28 62.93 99.18

16

β€’ POEM recovers same performance at fraction of L-BFGS costβ€’ Scales as supervised CRF, learns from bandit feedback

Page 17: Counterfactual Risk Minimization

(3) Does generalization improve as 𝑛 β†’ ∞?

17

2.75

3.25

3.75

4.25

1 2 4 8 16 32 64 128 256

Ξ΄

Replay Count

CRF h0 POEM

Page 18: Counterfactual Risk Minimization

(4) Does stochasticity of β„Ž0 affect learning?

18

0.8

1

1.2

1.4

1.6

0.5 1 2 4 8 16 32

Ξ΄

Param Multiplier

h0 POEM

Stochastic Deterministic

Page 19: Counterfactual Risk Minimization

Conclusion

β€’ CRM principle to learn from logged bandit feedbackβ€’ Variance regularization

β€’ POEM for structured output predictionβ€’ Scales as supervised CRF, learns from bandit feedback

β€’ Contact: [email protected]

β€’ POEM available at http://www.cs.cornell.edu/~adith/poem/

β€’ Long paper: Counterfactual risk minimization – Learning from logged bandit feedback, http://jmlr.org/proceedings/papers/v37/swaminathan15.html

β€’ Thanks!

19

Page 20: Counterfactual Risk Minimization

References1. Art B. Owen. 2013. Monte Carlo theory, methods and examples.

2. Paul R. Rosenbaum and Donald B. Rubin. 1983. The central role of the propensity score in observational studies for causal effects. Biometrika 70. 41-55.

3. LΓ©on Bottou, Jonas Peters, Joaquin QuiΓ±onero-Candela, Denis X. Charles, D. Max Chickering, Elon Portugaly, Dipankar Ray, Patrice Simard, and Ed Snelson. 2013. Counterfactual reasoning and learning systems: the example of computational advertising. J. Mach. Learn. Res. 14, 1, 3207-3260.

4. Alekh Agarwal, Daniel Hsu, Satyen Kale, John Langford, Lihong Li and Robert Schapire. 2014. Taming the Monster: A Fast and Simple Algorithm for Contextual Bandits. Proceedings of the 31st International Conference on Machine Learning. 1638-1646.

5. Andreas Maurer and Massimiliano Pontil. 2009. Empirical bernstein bounds and sample-variance penalization. Proceedings of the 22nd Conference on Learning Theory.

6. Adith Swaminathan and Thorsten Joachims. 2015. Counterfactual risk minimization: Learning from logged bandit feedback. Proceedings of the 32nd International Conference on Machine Learning.

20