sequential off-line learning with knowledge gradients peter frazier warren powell savas dayanik...
DESCRIPTION
3 Measurement Phase Alternative 1 N opportunities to do experiments Alternative M Experimental outcomes Alternative Alternative 2TRANSCRIPT
![Page 1: Sequential Off-line Learning with Knowledge Gradients Peter Frazier Warren Powell Savas Dayanik Department…](https://reader035.vdocuments.mx/reader035/viewer/2022081517/5a4d1bf57f8b9ab0599e7f7e/html5/thumbnails/1.jpg)
Sequential Off-line Learning with Knowledge Gradients
Peter FrazierWarren PowellSavas Dayanik
Department of Operations Research and Financial EngineeringPrinceton University
![Page 2: Sequential Off-line Learning with Knowledge Gradients Peter Frazier Warren Powell Savas Dayanik Department…](https://reader035.vdocuments.mx/reader035/viewer/2022081517/5a4d1bf57f8b9ab0599e7f7e/html5/thumbnails/2.jpg)
2
Overview
• Problem Formulation• Knowledge Gradient Policy• Theoretical Results• Numerical Results
![Page 3: Sequential Off-line Learning with Knowledge Gradients Peter Frazier Warren Powell Savas Dayanik Department…](https://reader035.vdocuments.mx/reader035/viewer/2022081517/5a4d1bf57f8b9ab0599e7f7e/html5/thumbnails/3.jpg)
3
Measurement Phase
Alternative 1
N opportunities todo experiments
Alternative M
+1+.2
-1 +1
Experimental outcomes
Alternative 3 -.2
Alternative 2 -1
![Page 4: Sequential Off-line Learning with Knowledge Gradients Peter Frazier Warren Powell Savas Dayanik Department…](https://reader035.vdocuments.mx/reader035/viewer/2022081517/5a4d1bf57f8b9ab0599e7f7e/html5/thumbnails/4.jpg)
4
Implementation Phase
Alternative 1
Alternative M
RewardAlternative 3
Alternative 2
![Page 5: Sequential Off-line Learning with Knowledge Gradients Peter Frazier Warren Powell Savas Dayanik Department…](https://reader035.vdocuments.mx/reader035/viewer/2022081517/5a4d1bf57f8b9ab0599e7f7e/html5/thumbnails/5.jpg)
5
Reward StructureOn-line Off-line
Sequential XBatchS
ampl
ing
Taxonomy of Sampling Problems
![Page 6: Sequential Off-line Learning with Knowledge Gradients Peter Frazier Warren Powell Savas Dayanik Department…](https://reader035.vdocuments.mx/reader035/viewer/2022081517/5a4d1bf57f8b9ab0599e7f7e/html5/thumbnails/6.jpg)
6
Example 1
• One common experimental design is to spread measurements equally across the alternatives.
xn = n (mod M ) +1
Alternative
Quality
![Page 7: Sequential Off-line Learning with Knowledge Gradients Peter Frazier Warren Powell Savas Dayanik Department…](https://reader035.vdocuments.mx/reader035/viewer/2022081517/5a4d1bf57f8b9ab0599e7f7e/html5/thumbnails/7.jpg)
7
Example 1
xn = n (mod M ) +1Round Robin Exploration
![Page 8: Sequential Off-line Learning with Knowledge Gradients Peter Frazier Warren Powell Savas Dayanik Department…](https://reader035.vdocuments.mx/reader035/viewer/2022081517/5a4d1bf57f8b9ab0599e7f7e/html5/thumbnails/8.jpg)
8
• How might we improve round-robin exploration for use with this prior?
Example 2
xn = argmaxx ¾nx
![Page 9: Sequential Off-line Learning with Knowledge Gradients Peter Frazier Warren Powell Savas Dayanik Department…](https://reader035.vdocuments.mx/reader035/viewer/2022081517/5a4d1bf57f8b9ab0599e7f7e/html5/thumbnails/9.jpg)
9
Example 2
xn = argmaxx ¾nxLargest Variance Exploration
![Page 10: Sequential Off-line Learning with Knowledge Gradients Peter Frazier Warren Powell Savas Dayanik Department…](https://reader035.vdocuments.mx/reader035/viewer/2022081517/5a4d1bf57f8b9ab0599e7f7e/html5/thumbnails/10.jpg)
10
Example 3
Exploitation: xn = argmaxx ¹ nx
![Page 11: Sequential Off-line Learning with Knowledge Gradients Peter Frazier Warren Powell Savas Dayanik Department…](https://reader035.vdocuments.mx/reader035/viewer/2022081517/5a4d1bf57f8b9ab0599e7f7e/html5/thumbnails/11.jpg)
11
Model
• xn is the alternative tested at time n.• Measure the value of alternative xn
• At time n, , independent• Error n+1 is independent N(0,()2).• At time N, choose an alternative.• Maximize
yn+1 = Yxn +"n+1
Yx » N (¹ nx ; (¾n
x )2)
IE [YxN ]= IE£maxx ¹ Nx¤
![Page 12: Sequential Off-line Learning with Knowledge Gradients Peter Frazier Warren Powell Savas Dayanik Department…](https://reader035.vdocuments.mx/reader035/viewer/2022081517/5a4d1bf57f8b9ab0599e7f7e/html5/thumbnails/12.jpg)
12
State Transition
• At time n measure alternative xn. We update our estimate of based on the measurement
• Estimates of other Yx do not change.
Yxn yn+1
¹ n+1x = (¾n
x )¡ 2¹ nx +(¾²)¡ 2yn+1
(¾nx )¡ 2 +(¾²)¡ 2
(¾n+1x )¡ 2 = (¾n
x )¡ 2 +(¾²)¡ 2for x = xn
¹ n+1x = ¹ n
x for x 6= xn
¾n+1x = ¾n
x for x 6= xn for x 6= xn
![Page 13: Sequential Off-line Learning with Knowledge Gradients Peter Frazier Warren Powell Savas Dayanik Department…](https://reader035.vdocuments.mx/reader035/viewer/2022081517/5a4d1bf57f8b9ab0599e7f7e/html5/thumbnails/13.jpg)
13
• At time n, n+1x is a normal random variable
with mean nx and variance satisfying
uncertainty about Yx before the measurement
uncertainty about Yx after the measurement
change in best estimate of Yx due to the measurement
(~¾n+1x )2
(¾n+1x )2 = (¾n
x )2 ¡ (~¾n+1x )2
VN (¹ N ;¾N ) = maxx
¹ Nx
Vn(¹ n ;¾n) = maxxn IE£Vn+1(¹ n+1;¾n+1)¤n
• The value of the optimal policy satisfies Bellman’s equation
![Page 14: Sequential Off-line Learning with Knowledge Gradients Peter Frazier Warren Powell Savas Dayanik Department…](https://reader035.vdocuments.mx/reader035/viewer/2022081517/5a4d1bf57f8b9ab0599e7f7e/html5/thumbnails/14.jpg)
14
Utility of Information
Consider our “utility of information”,U(¹ n;¾n) := maxx ¹ n
x and consider the random change in utility
due to a measurement at time n¢Un+1 := U(¹ n+1;¾n+1) ¡ U(¹ n;¾n)
![Page 15: Sequential Off-line Learning with Knowledge Gradients Peter Frazier Warren Powell Savas Dayanik Department…](https://reader035.vdocuments.mx/reader035/viewer/2022081517/5a4d1bf57f8b9ab0599e7f7e/html5/thumbnails/15.jpg)
15
Knowledge Gradient Definition
• The knowledge gradient policy chooses the measurement that maximizes this expected increase in utility,
xn = argmaxx IEn£¢Un+1¤
![Page 16: Sequential Off-line Learning with Knowledge Gradients Peter Frazier Warren Powell Savas Dayanik Department…](https://reader035.vdocuments.mx/reader035/viewer/2022081517/5a4d1bf57f8b9ab0599e7f7e/html5/thumbnails/16.jpg)
16
Knowledge Gradient
• We may compute the knowledge gradient policy via
IEn£¢Un+1¤ := IEn
£U(¹ n+1;¾n+1) ¡ U(¹ n;¾n)¤
= IEnh³
maxx
¹ n+1x
´¡ max
x¹ n
xi
= IEn
· µ¹ n+1
xn _ maxx6=xn
¹ nx
¶¡ max
x¹ n
x
¸:
which is the expectation of the maximum of a normal and a constant.
![Page 17: Sequential Off-line Learning with Knowledge Gradients Peter Frazier Warren Powell Savas Dayanik Department…](https://reader035.vdocuments.mx/reader035/viewer/2022081517/5a4d1bf57f8b9ab0599e7f7e/html5/thumbnails/17.jpg)
17
Knowledge Gradient
~¾(s) := s2=p
s2 +(¾²)2³nx := ¡ j¹ n
x ¡ maxx06=x
¹ nx0j=~¾(¾n
x )f (z) := z©(z) + ' (z)
The computation becomesargmaxx ~¾(¾n
x )f (³nx )
where is the normal cdf, is the normal pdf, and
![Page 18: Sequential Off-line Learning with Knowledge Gradients Peter Frazier Warren Powell Savas Dayanik Department…](https://reader035.vdocuments.mx/reader035/viewer/2022081517/5a4d1bf57f8b9ab0599e7f7e/html5/thumbnails/18.jpg)
18
Optimality Results
• If our measurement budget allows only one measurement (N=1), the knowledge gradient policy is optimal.
![Page 19: Sequential Off-line Learning with Knowledge Gradients Peter Frazier Warren Powell Savas Dayanik Department…](https://reader035.vdocuments.mx/reader035/viewer/2022081517/5a4d1bf57f8b9ab0599e7f7e/html5/thumbnails/19.jpg)
19
Optimality Results
• The knowledge gradient policy is optimal in the limit as the measurement budget N grows to infinity.
• This is really a convergence result.
![Page 20: Sequential Off-line Learning with Knowledge Gradients Peter Frazier Warren Powell Savas Dayanik Department…](https://reader035.vdocuments.mx/reader035/viewer/2022081517/5a4d1bf57f8b9ab0599e7f7e/html5/thumbnails/20.jpg)
20
Optimality Results• The knowledge gradient policy has sub-optimality
bounded by
where VKG,n gives the value of the knowledge gradient policy and Vn the value of the optimal policy.
· (2¼)¡ 1=2(N ¡ n ¡ 1)maxx ~¾(¾nx )
Vn(¹ n;¾n) ¡ VK G;n(¹ n;¾n)
![Page 21: Sequential Off-line Learning with Knowledge Gradients Peter Frazier Warren Powell Savas Dayanik Department…](https://reader035.vdocuments.mx/reader035/viewer/2022081517/5a4d1bf57f8b9ab0599e7f7e/html5/thumbnails/21.jpg)
21
Optimality Results
• If there are exactly 2 alternatives (M=2), the knowledge gradient policy is optimal.
• In this case, the optimal policy reduces toxn = argmaxx ¾n
x
![Page 22: Sequential Off-line Learning with Knowledge Gradients Peter Frazier Warren Powell Savas Dayanik Department…](https://reader035.vdocuments.mx/reader035/viewer/2022081517/5a4d1bf57f8b9ab0599e7f7e/html5/thumbnails/22.jpg)
22
Optimality Results
• If there is no measurement noise and alternatives may be reordered so that
then the knowledge gradient policy is optimal.
¹ 01 ¸ ¹ 0
2 ¸ :: : ¸ ¹ 0M
¾01 ¸ ¾0
2 ¸ :: : ¸ ¾0M
![Page 23: Sequential Off-line Learning with Knowledge Gradients Peter Frazier Warren Powell Savas Dayanik Department…](https://reader035.vdocuments.mx/reader035/viewer/2022081517/5a4d1bf57f8b9ab0599e7f7e/html5/thumbnails/23.jpg)
23
Numerical Experiments
• 100 randomly generated problems• M Uniform {1,...100}• N Uniform {M, 3M, 10M}• 0
x Uniform [-1,1]
• (0x)2 = 1 with probability 0.9
= 10-3 with probability 0.1• = 1
![Page 24: Sequential Off-line Learning with Knowledge Gradients Peter Frazier Warren Powell Savas Dayanik Department…](https://reader035.vdocuments.mx/reader035/viewer/2022081517/5a4d1bf57f8b9ab0599e7f7e/html5/thumbnails/24.jpg)
24
Numerical Experiments
![Page 25: Sequential Off-line Learning with Knowledge Gradients Peter Frazier Warren Powell Savas Dayanik Department…](https://reader035.vdocuments.mx/reader035/viewer/2022081517/5a4d1bf57f8b9ab0599e7f7e/html5/thumbnails/25.jpg)
25
• Compare alternatives via a linear combination of mean and standard deviation.
• The parameter z/2 controls the tradeoff between exploration and exploitation.
Interval Estimation
xn = argmaxx ¹ nx +z®=2¾n
x
![Page 26: Sequential Off-line Learning with Knowledge Gradients Peter Frazier Warren Powell Savas Dayanik Department…](https://reader035.vdocuments.mx/reader035/viewer/2022081517/5a4d1bf57f8b9ab0599e7f7e/html5/thumbnails/26.jpg)
26
KG / IE Comparison
![Page 27: Sequential Off-line Learning with Knowledge Gradients Peter Frazier Warren Powell Savas Dayanik Department…](https://reader035.vdocuments.mx/reader035/viewer/2022081517/5a4d1bf57f8b9ab0599e7f7e/html5/thumbnails/27.jpg)
27
KG / IE Comparison
Val
ue o
f KG
– V
alue
of I
E
![Page 28: Sequential Off-line Learning with Knowledge Gradients Peter Frazier Warren Powell Savas Dayanik Department…](https://reader035.vdocuments.mx/reader035/viewer/2022081517/5a4d1bf57f8b9ab0599e7f7e/html5/thumbnails/28.jpg)
28
IE and “Sticking”
¹ = [2:5;0;: : : ;0], ¾= [0;1;: : : ;1]
Alternative 1is known perfectly
xn = argmaxx ¹ nx +z®=2¾n
x ; z®=2 = 2:5
![Page 29: Sequential Off-line Learning with Knowledge Gradients Peter Frazier Warren Powell Savas Dayanik Department…](https://reader035.vdocuments.mx/reader035/viewer/2022081517/5a4d1bf57f8b9ab0599e7f7e/html5/thumbnails/29.jpg)
29
IE and “Sticking”
V I E = 2:5; V = IE[max(2:5;Z1; : : : ;ZM )]% 1 as M % 1
![Page 30: Sequential Off-line Learning with Knowledge Gradients Peter Frazier Warren Powell Savas Dayanik Department…](https://reader035.vdocuments.mx/reader035/viewer/2022081517/5a4d1bf57f8b9ab0599e7f7e/html5/thumbnails/30.jpg)
30
Thank YouAny Questions?
![Page 31: Sequential Off-line Learning with Knowledge Gradients Peter Frazier Warren Powell Savas Dayanik Department…](https://reader035.vdocuments.mx/reader035/viewer/2022081517/5a4d1bf57f8b9ab0599e7f7e/html5/thumbnails/31.jpg)
31
Numerical Example 1
xn = argmaxx ~¾nx f (³n
x ) = 1
x ¹ x ¾2x ~¾x = ¾2
x=p
¾2x +(¾²)2 ¢ x = j¹ x ¡ maxx06=x
¹ x0j ³x = ¡ ¢ x=~¾x ~¾xf (³x)1 0 4 1.789 0 0 0.7142 0 2 1.155 0 0 0.4613 0 1 0.707 0 0 0.282
![Page 32: Sequential Off-line Learning with Knowledge Gradients Peter Frazier Warren Powell Savas Dayanik Department…](https://reader035.vdocuments.mx/reader035/viewer/2022081517/5a4d1bf57f8b9ab0599e7f7e/html5/thumbnails/32.jpg)
32
Numerical Example 2
xn = argmaxx ~¾nx f (³n
x ) = 2
x ¹ x ¾2x ~¾x = ¾2
x=p
¾2x +(¾²)2 ¢ x = j¹ x ¡ maxx06=x
¹ x0j ³x = ¡ ¢ x=~¾x ~¾xf (³x)1 1 0 0 1 ¡ 1 02 0 1 0.707 1 -1.414 0.0253 -1 1 0.707 2 -2.828 0.001
![Page 33: Sequential Off-line Learning with Knowledge Gradients Peter Frazier Warren Powell Savas Dayanik Department…](https://reader035.vdocuments.mx/reader035/viewer/2022081517/5a4d1bf57f8b9ab0599e7f7e/html5/thumbnails/33.jpg)
33
Numerical Example 3
xn = argmaxx ~¾nx f (³n
x ) = 2
x ¹ x ¾2x ~¾x = ¾2
x=p
¾2x +(¾²)2 ¢ x = j¹ x ¡ maxx06=x
¹ x0j ³x = ¡ ¢ x=~¾x ~¾xf (³x)1 1 1 0.707 1 -1.41 0.02512 0 4 1.789 1 -0.56 0.32233 -1 1 0.707 2 -2.83 0.0005
![Page 34: Sequential Off-line Learning with Knowledge Gradients Peter Frazier Warren Powell Savas Dayanik Department…](https://reader035.vdocuments.mx/reader035/viewer/2022081517/5a4d1bf57f8b9ab0599e7f7e/html5/thumbnails/34.jpg)
34
Knowledge Gradient Example
![Page 35: Sequential Off-line Learning with Knowledge Gradients Peter Frazier Warren Powell Savas Dayanik Department…](https://reader035.vdocuments.mx/reader035/viewer/2022081517/5a4d1bf57f8b9ab0599e7f7e/html5/thumbnails/35.jpg)
35
Interval Estimation Example
xn = argmaxx ¹ nx +3¾n
x
![Page 36: Sequential Off-line Learning with Knowledge Gradients Peter Frazier Warren Powell Savas Dayanik Department…](https://reader035.vdocuments.mx/reader035/viewer/2022081517/5a4d1bf57f8b9ab0599e7f7e/html5/thumbnails/36.jpg)
36
Boltzmann Exploration
Parameterized by a declining sequence of temperatures (T0,...TN-1).
IPn f xn = xg= exp(¹ nx =T n )P M
x 0=1 exp(¹ nx 0=T n )