counterfactual risk minimization
Post on 18-Oct-2021
16 Views
Preview:
TRANSCRIPT
Counterfactual Risk MinimizationLearning from logged bandit
feedbackAdith Swaminathan, Thorsten Joachims
Software: http://www.cs.cornell.edu/~adith/poem/
Learning frameworks
π
πOnline Batch
Full Information Perceptron, β¦ SVM, ...
Bandit Feedback LinUCB, β¦ ?
2
Alice: Cool!
SpaceX launch
Logged bandit feedback is everywhere!
3
π₯π: Query
π₯1 = (π΄ππππ, π‘ππβ)π¦1 = (ππππππ)Ξ΄1 = 1
π¦π: Prediction
Ξ΄π: Feedback
Alice: Want Tech news
Goal
β’ Risk of β: π β¦ ππ β = πΌπ₯ Ξ΄(π₯, β π₯ )
β’ Find ββ β π with minimum risk
β’ Can we find ββ using π = collected from β0?
4
π₯1 = (π΄ππππ, π‘ππβ)π¦1 = (ππππππ)Ξ΄1 = 1
π₯2 = (π΄ππππ, π ππππ‘π )π¦2 = (πΉ1)Ξ΄2 = 5
Learning by replaying logs?
β’ Training/evaluation from logged data is counter-factual [Bottou et al]
5
π₯1 = (π΄ππππ, π‘ππβ)π¦1 = (ππππππ)Ξ΄1 = 1
π₯2 = (π΄ππππ, π ππππ‘π )π¦2 = (πΉ1)Ξ΄2 = 5
π₯1: (Alice,tech)
π¦1β² : Apple watch
What would Alice do ??
π₯ βΌ Pr(π), however π¦ = β0(π₯), not β(π₯)
β’ Stochastic β: π β¦ β(π), π¦ βΌ β(π₯)π β = πΌπ₯πΌπ¦βΌβ(π₯) Ξ΄(π₯, π¦)
Stochastic policies to the rescue!
Likely
Unlikely
π
SpaceX launch
6
β0β
Counterfactual risk estimators
β’ π = { π₯1, π¦1, Ξ΄1, π1 , π₯2, π¦2, Ξ΄2, π2 , β¦ , π₯π, π¦π, Ξ΄π, ππ }
β’ ππ = β0 π¦π π₯π β¦ propensity [Rosenbaum et al]
πΉπ π =π
π
π=π
π
πΉππ(ππ|ππ)
ππ
Basic Importance Sampling [Owen]
πΌπ₯ πΌπ¦βΌβ Ξ΄ π₯, π¦ = πΌπ₯ πΌπ¦βΌβπ Ξ΄(π₯, π¦)β(π¦|π₯)
β0(π¦|π₯)Perf of new system Samples from old system Importance weight
7
Story so far
8
Logged bandit data
Fix β π£π . β0mismatch
Control variance
Tractable bound
Importance sampling causes non-uniform variance!
9
Logged bandit data
Fix β π£π . β0mismatch
Control variance
Tractable bound
π₯1 = (π΄ππππ, π ππππ‘π )π¦1 = (πΉ1)πΏ1 = 5π1 = 0.9
π₯2 = (π΄ππππ, π‘ππβ)π¦2 = (ππππππ)πΏ2 = 1π2 = 0.9
π₯3 = (π΄ππππ,πππ£πππ )π¦3 = (ππ‘ππ ππππ )πΏ3 = 2π3 = 0.9
π₯4 = (π΄ππππ, π‘ππβ)π¦4 = (πππ ππ)πΏ4 = 1π4 = 0.9
β1 π π β1 = 1
β2 π π β2 = 1.33
Want: Error bound that captures variance of importance sampling
Capacity control
Counterfactual Risk Minimization
β’ W.h.p. in π βΌ β0
ββ β π, π (β) β€ π π(β) + π ππππ β π + π( πβ(π) π)
*conditions apply. Refer [Maurer et al]
Empirical risk Variance regularization
Logged bandit data
Fix β π£π . β0mismatch
Control variance
Tractable bound
10
Learning objective
βπΆπ π = argminββπ
π π β + Ξ» ππππ β
π
POEM: CRM algorithm for structured prediction
β’ CRFs: βπ€ β ππππ; βπ€ π¦ π₯ =exp(π€Ο(π₯,π¦))
β€(π₯;π€)
β’ Policy Optimizer for Exponential Models :
π€β = argminπ€
1
π
π=1
π
Ξ΄πβπ€(π¦π|π₯π)
ππ+ Ξ» πππ(ππ€)
π+ ΞΌ π€ 2
11
Good: Gradient descent, search over infinitely
many w
Bad: Not convex in w
Ugly: Resists stochastic
optimization
Logged bandit data
Fix β π£π . β0mismatch
Control variance
Tractable bound
Stochastically optimize πππ(ππ€) ?
β’ Taylor-approximate!
πππ(ππ€) β€ π΄π€π‘
π=1
π
βπ€π + π΅π€π‘
π=1
π
βπ€π 2 + πΆπ€π‘
β’ During epoch: Adagrad with π»βπ€π + Ξ» π(π΄π€π‘π»βπ€
π + 2π΅π€π‘βπ€π π»βπ€π )}
β’ After epoch: π€π‘+1 β€ π€, compute π΄π€π‘+1 , π΅π€π‘+1
π€π‘
wπ€π‘+1
12
Logged bandit data
Fix β π£π . β0mismatch
Control variance
Tractable bound
Experiment
β’ Supervised Bandit MultiLabel [Agarwal et al]
β’ Ξ΄ π₯, π¦ = Hamming(π¦β π₯ , π¦) (smaller is better)
β’ LibSVM Datasetsβ’ Scene (few features, labels and data)β’ Yeast (many labels)β’ LYRL (many features and data)β’ TMC (many features, labels and data)
β’ Validate hyper-params (Ξ»,ΞΌ) using π ππ£ππ(β)
β’ Supervised test set expected Hamming loss
13
Approaches
β’ Baselinesβ’ β0: Supervised CRF trained on 5% of training data
β’ Proposedβ’ IPS (No variance penalty) (extends [Bottou et al])
β’ POEM
β’ Skylinesβ’ Supervised CRF (independent logit regression)
14
(1) Does variance regularization help?
15
1.543
5.547
1.463
3.445
1.519
4.614
1.118
3.023
1.143
4.517
0.996
2.522
0
1
2
3
4
5
6
Scene Yeast LYRL TMC
Ξ΄
Test set expected Hamming Loss
h0 IPS POEM Supervised CRF
0.659
2.822
0.222
1.189
(2) Is it efficient?
Avg Time (s) Scene Yeast LYRL TMC
POEM(B) 75.20 94.16 561.12 949.95
POEM(S) 4.71 5.02 120.09 276.13
CRF 4.86 3.28 62.93 99.18
16
β’ POEM recovers same performance at fraction of L-BFGS costβ’ Scales as supervised CRF, learns from bandit feedback
(3) Does generalization improve as π β β?
17
2.75
3.25
3.75
4.25
1 2 4 8 16 32 64 128 256
Ξ΄
Replay Count
CRF h0 POEM
(4) Does stochasticity of β0 affect learning?
18
0.8
1
1.2
1.4
1.6
0.5 1 2 4 8 16 32
Ξ΄
Param Multiplier
h0 POEM
Stochastic Deterministic
Conclusion
β’ CRM principle to learn from logged bandit feedbackβ’ Variance regularization
β’ POEM for structured output predictionβ’ Scales as supervised CRF, learns from bandit feedback
β’ Contact: adith@cs.cornell.edu
β’ POEM available at http://www.cs.cornell.edu/~adith/poem/
β’ Long paper: Counterfactual risk minimization β Learning from logged bandit feedback, http://jmlr.org/proceedings/papers/v37/swaminathan15.html
β’ Thanks!
19
References1. Art B. Owen. 2013. Monte Carlo theory, methods and examples.
2. Paul R. Rosenbaum and Donald B. Rubin. 1983. The central role of the propensity score in observational studies for causal effects. Biometrika 70. 41-55.
3. LΓ©on Bottou, Jonas Peters, Joaquin QuiΓ±onero-Candela, Denis X. Charles, D. Max Chickering, Elon Portugaly, Dipankar Ray, Patrice Simard, and Ed Snelson. 2013. Counterfactual reasoning and learning systems: the example of computational advertising. J. Mach. Learn. Res. 14, 1, 3207-3260.
4. Alekh Agarwal, Daniel Hsu, Satyen Kale, John Langford, Lihong Li and Robert Schapire. 2014. Taming the Monster: A Fast and Simple Algorithm for Contextual Bandits. Proceedings of the 31st International Conference on Machine Learning. 1638-1646.
5. Andreas Maurer and Massimiliano Pontil. 2009. Empirical bernstein bounds and sample-variance penalization. Proceedings of the 22nd Conference on Learning Theory.
6. Adith Swaminathan and Thorsten Joachims. 2015. Counterfactual risk minimization: Learning from logged bandit feedback. Proceedings of the 32nd International Conference on Machine Learning.
20
top related