limitation of markov models and event-based learning
TRANSCRIPT
![Page 1: Limitation of Markov Models and Event-Based Learning](https://reader031.vdocuments.mx/reader031/viewer/2022021307/62074cc849d709492c300a3d/html5/thumbnails/1.jpg)
1
Plenary Presentationat
2008 Chinese Control and Decision ConferenceJuly 2, 2008 Yaitai, China
Limitation of Markov Models andEvent-Based Learning & Optimization
Xi-Ren Cao
Hong Kong University of Science and Technology
![Page 2: Limitation of Markov Models and Event-Based Learning](https://reader031.vdocuments.mx/reader031/viewer/2022021307/62074cc849d709492c300a3d/html5/thumbnails/2.jpg)
2
Table of Contents
0. Review: Optimization Problems (state-based policies)
1. Event-Based FormulationLimitation of the state-based formulation Events and event-based policiesEvent-Based Formulation
2. Sensitivity-Based Approach to OptimizationA unified framework for optimizationExtensions to event-based optimization
3. Summary
Structure of the Presentation
Overview ofState-Based Optimization
Introduction toEvent-Based Formulation
Sensitivity-Based Approach toState-Based Optimization
Solution toEvent-Based Optimization
Introduction toEvent-Based Formulation
Sensitivity-Based Approach toState-Based Optimization
Solution toEvent-Based Optimization
Overview ofState-Based Optimization
![Page 3: Limitation of Markov Models and Event-Based Learning](https://reader031.vdocuments.mx/reader031/viewer/2022021307/62074cc849d709492c300a3d/html5/thumbnails/3.jpg)
3
wBuAxdtdx
++=
Cxu −=u
x
x: Stateu: Control variablew: Random noise
∫=T
dttutxfET 0
)]}(),([{1η
Performance measure
LQG problem
A Typical Formulation of a Control Problem(Continuous Time Continuous State Model)
Control u depends on state xA policy u(x): x u ∫ +=
T
dtBuuAxxET 0
}{1 ττη
![Page 4: Limitation of Markov Models and Event-Based Learning](https://reader031.vdocuments.mx/reader031/viewer/2022021307/62074cc849d709492c300a3d/html5/thumbnails/4.jpg)
4
Discrete-time Discrete State Model (I)- an example
A random walk of a robot
1 (100) (-100) 2
0 (0)
q
p
(100) 43 (-100)
1=+ qp
Reward function
Probabilities
f(0) = 0f(1) =f(4)=100f(2) =f(3)= -100
∑−
=∞→=
1
0
)(1limT
ttT Xf
Tη
Performance measure0
1 2
3 4
α 1−α
α 1−α
![Page 5: Limitation of Markov Models and Event-Based Learning](https://reader031.vdocuments.mx/reader031/viewer/2022021307/62074cc849d709492c300a3d/html5/thumbnails/5.jpg)
5
Shannon Mouse (Theseus)
![Page 6: Limitation of Markov Models and Event-Based Learning](https://reader031.vdocuments.mx/reader031/viewer/2022021307/62074cc849d709492c300a3d/html5/thumbnails/6.jpg)
6
A Sample Path (system dynamics):
A random walk of a robot
4
x
t
0
3
0
1
0
2
0
34
1 (100) (-100) 2
0 (0)
q
p
(100) 43 (-100)
Discrete Model (II)- the dynamics
α 1−α
α 1−α
![Page 7: Limitation of Markov Models and Event-Based Learning](https://reader031.vdocuments.mx/reader031/viewer/2022021307/62074cc849d709492c300a3d/html5/thumbnails/7.jpg)
7
System performance: – Reward function: f=(f(1),…,f(M))T
– Performance measure:∑∑∈
−
=∞→ ===
Si
T
ttT ififXf
T)()()(1lim
1
0
ππη
Steady-state probability:– Steady-state probability: π=(π(1), π(2),..,π(M)).
π(I-P)=0, πe=1 I:identity matrix, e=(1,…,1)T
1
32
p(1,3)p(3,1)
p(2,1)
p(3,2)
p(1,2)p(2,3)
Discrete Model (III)- the Markov model
System dynamics:-X = {Xn, n=1,2,…}, Xn in S = {1,2,…,M}- Transition Prob. Matrix P=[p(i,j)]i,j=1,..,M
1 2
3 4
α α−1
α α−1
0p
q
Random Walker
![Page 8: Limitation of Markov Models and Event-Based Learning](https://reader031.vdocuments.mx/reader031/viewer/2022021307/62074cc849d709492c300a3d/html5/thumbnails/8.jpg)
8
Control of Transition Probabilities
1 (100) (-100) 2
0 (0)
q
p
(100) 43 (-100)
Turn on red with prob. α
- move to leftα
α
Turn on green with prob. 1- α
- move to right
1−α
1−α
![Page 9: Limitation of Markov Models and Event-Based Learning](https://reader031.vdocuments.mx/reader031/viewer/2022021307/62074cc849d709492c300a3d/html5/thumbnails/9.jpg)
9
α: Action controls transition probabilitiespα(i,j): governs the system dynamicsα=d(x): policy (state based)
)(xd=αα
x1
32
),( jipα
System dynamics: Markov model
Performance depend on policies, πd , ηd , etc
Goal of Optimization: Find a policy d that maximizes ηd in policy space
- the Control Model
Discrete Model (IV)- Markov decision processes (MDPs)
wBuAxdtdx
++=
Cxu −=u
x
∑−
=∞→=
1
0)(1lim
T
t
dtT
d XfT
η
![Page 10: Limitation of Markov Models and Event-Based Learning](https://reader031.vdocuments.mx/reader031/viewer/2022021307/62074cc849d709492c300a3d/html5/thumbnails/10.jpg)
10
0. Review: Optimization Problems (state-based policies)
1. Event-Based OptimizationLimitation of the state-based formulation Events and event-based policiesEvent-Based Optimization
2. Sensitivity-Based Approach to Optimization A unified framework for optimizationExtensions to event-based optimization
3. Summary
Overview ofState-Based Optimization
Introduction toEvent-Based Optimization Sensitivity-Based Approach to
State-Based Optimization
Solution toEvent-Based Optimization
![Page 11: Limitation of Markov Models and Event-Based Learning](https://reader031.vdocuments.mx/reader031/viewer/2022021307/62074cc849d709492c300a3d/html5/thumbnails/11.jpg)
11
The policy space is too largeM = 100 states, N=2 actions,
NM = 2100= 1030 policies(10GHZ 3* 1012 years to count!)
Special structures not utilized
Limitation of State-Based Formulation (I)
May not perform well
![Page 12: Limitation of Markov Models and Event-Based Learning](https://reader031.vdocuments.mx/reader031/viewer/2022021307/62074cc849d709492c300a3d/html5/thumbnails/12.jpg)
12
Limitation of State-Based Formulation (II)
Example: Random walk of a robot
Choose α to maximize the average performance
0
1 2(100) (-100)
3 4(-100) (100)
α α−1
α α−1
p
q
1 (100) (-100) 2
0 (0)
(100) 43 (-100)
p
q
α
α
1−α
1−α
![Page 13: Limitation of Markov Models and Event-Based Learning](https://reader031.vdocuments.mx/reader031/viewer/2022021307/62074cc849d709492c300a3d/html5/thumbnails/13.jpg)
13
)1( α−p )1( α−qαq01 2 3 4αp
Transition probabilities:
At state 0, if moves top, α needs to be as large as possibleif moves down, α needs to be as small as possible
Let p = q = 1/2,Average perf in next step = 0, no matter what α you choose (best you can do with a state-based model)
(-100)
0
1 2
3 4
α α−1
α α−1
p
q
(100)
(-100) (100)
A large α leads a largereward at state 1
(100)
But a small reward at state 3 (-100)
But a small reward at state 2
(-100)
A small α leads a largereward at state 4
(100)
Limitation of State-Based Formulation (III)
![Page 14: Limitation of Markov Models and Event-Based Learning](https://reader031.vdocuments.mx/reader031/viewer/2022021307/62074cc849d709492c300a3d/html5/thumbnails/14.jpg)
14
We can do better!
Group two up transitions together as an event “a” and two down transitions as event “b”.When “a” happens, choose the largest α,When “b” happens, choose the smallest α.Average performance = 100, if α=1.
0
1 2(100) (-100)
3 4(-100) (100)
α α−1
α α−1
a
b
1/2
1/2
α large
α small
![Page 15: Limitation of Markov Models and Event-Based Learning](https://reader031.vdocuments.mx/reader031/viewer/2022021307/62074cc849d709492c300a3d/html5/thumbnails/15.jpg)
15
Events and Event-Based Policies
An event is defined as a set of state transitionsEvent-based optimization:
• May lead to a better performance than the state-based formulation• MDP model may not fit:
- Only controls a part of transitions - An event may consist of transitions from many states
• May reflect and utilize special structuresQuestions:
• Why it may be better?• How general is the formulation?• How to solve event-based optimization problems?
01 2 3 4αp )1( α−p
Event a
αq )1( α−q
Event b
0
1 2(100) (-100)
3 4(-100) (100)
α α−1
α α−1
a
b
1/2
1/2
α large
α small
![Page 16: Limitation of Markov Models and Event-Based Learning](https://reader031.vdocuments.mx/reader031/viewer/2022021307/62074cc849d709492c300a3d/html5/thumbnails/16.jpg)
16
Notations:A single transition <i,j>,
i,j in S ={1,2, …, M}An event: a set of transitions,
2M setsa = {<0,1>, <0,2>} b = {<0,3>, <0,4>}
Why it is better?An event contains information
about the future!(compared with the state-based policies)
Physical interpretation
α
α
1−α
1−α
(b)
(a)
1 (100) (-100) 2
0 (0)
q
p
(100) 43 (-100)
![Page 17: Limitation of Markov Models and Event-Based Learning](https://reader031.vdocuments.mx/reader031/viewer/2022021307/62074cc849d709492c300a3d/html5/thumbnails/17.jpg)
17
How general is the formulation?
λα(n)
1-α(n)
q0iqij
n: populationNo. of customers in network
ni: No. of customers at server i n=(n1,…,nM): stateN: network capacity
Event: a customer arrival finding population nAction: accept or reject
Only applies when an event occursMDP does not apply: Same action is applied for different
state with the same population
Admission control
![Page 18: Limitation of Markov Models and Event-Based Learning](https://reader031.vdocuments.mx/reader031/viewer/2022021307/62074cc849d709492c300a3d/html5/thumbnails/18.jpg)
18
Riemann Sampling vs. Lebesgue Sampling
Sample the system whenever the signal reaches a certain prespecified level, and control is added then.
t1 t2 tk… …
d3
d2
d1
d4d5
t1 t2 tk… ……
RS:
LS:
![Page 19: Limitation of Markov Models and Event-Based Learning](https://reader031.vdocuments.mx/reader031/viewer/2022021307/62074cc849d709492c300a3d/html5/thumbnails/19.jpg)
19
*x
x̂
*1τ *
2τ
X(t)
t
A Model for Stock Price or Financial Assess
.),()),(,()())(,())(,()( ∫ −++= dzdtNztXttdwtXtdttXtbtdX γσ
w(t): Brownian motion; N(dt,dz): Poisson random measureX(t): Ito-Levy process
![Page 20: Limitation of Markov Models and Event-Based Learning](https://reader031.vdocuments.mx/reader031/viewer/2022021307/62074cc849d709492c300a3d/html5/thumbnails/20.jpg)
20
How to solve event-based optimization problems?
0. Review: Optimization Problems (state-based policies)
1. Event-Based OptimizationLimitation of the state-based formulation Events and event-based policiesEvent-Based Optimization
2. Sensitivity-Based Approach to Optimization A unified framework for optimizationExtensions to event-based optimization
3. Summary
Overview ofState-Based Optimization
Introduction toEvent-Based Optimization
Sensitivity-Based Approach toState-Based Optimization
Solution toEvent-Based Optimization
![Page 21: Limitation of Markov Models and Event-Based Learning](https://reader031.vdocuments.mx/reader031/viewer/2022021307/62074cc849d709492c300a3d/html5/thumbnails/21.jpg)
21
An overview of the paths to the top of a hill
![Page 22: Limitation of Markov Models and Event-Based Learning](https://reader031.vdocuments.mx/reader031/viewer/2022021307/62074cc849d709492c300a3d/html5/thumbnails/22.jpg)
22
(perturbation analysis)Continuous Parameters
θ
A Sensitivity-Based View of Optimization
(policy iteration)Discrete Policy Space
θ+Δθ
Qgdd π
δη
=
η: performanceπ: steady-state probg: perf. potentialsQ=P’-P
Qg'' πηη =−
![Page 23: Limitation of Markov Models and Event-Based Learning](https://reader031.vdocuments.mx/reader031/viewer/2022021307/62074cc849d709492c300a3d/html5/thumbnails/23.jpg)
23
Poisson Equationg(i) = potential contribution of state i (potential, or bias)
= contribution of the current state f(i)-η+ expected long term contribution after a transition
∑=
+−=M
jjgjipifig
1)(),()()( η
In matrix (Poisson equation): fegPI =+− η)(Potential is relative: if g(i) is solution, i=1,…, M, so is g(i)+c, c: constant
t
0
4x
3
01
0
2
0
3 4
Physical interpretation:
∑∞
=
=−=0
0 }|])([{)(l
l iXXfEig η≈)4(g average of ∑ )( lXf
starting from 40 =X
![Page 24: Limitation of Markov Models and Event-Based Learning](https://reader031.vdocuments.mx/reader031/viewer/2022021307/62074cc849d709492c300a3d/html5/thumbnails/24.jpg)
24
For two Markov chains P, η, π and P’, η’, π’, let Q=P’-P
gPPQg )'(''' −==− ππηηPerformance difference:
fegPI =+− η)(:'π×One line simple derivation:
Two Sensitivity Formulas
gd
dPd
dθθπ
θθη )()(
=
Performance derivative: P is a function of θ: P(θ )
Derivative =average change in expected potential at next step
Perturbation analysis: choose the direction with the largest average change in expected potential at next step
])([ gPdd θπθ
=
![Page 25: Limitation of Markov Models and Event-Based Learning](https://reader031.vdocuments.mx/reader031/viewer/2022021307/62074cc849d709492c300a3d/html5/thumbnails/25.jpg)
25
Policy Iteration
gPPQg )'(''' −==− ππηη
1. η’>η if P’g>Pg (Fact: π’>0 )
2. Policy iteration: At any state find a policy P’ with P’g>Pg
3. Reinforcement learning (Stochastic approximation algorithms)
Policy iteration: Choose the action with largest changes in expected potential at next step
![Page 26: Limitation of Markov Models and Event-Based Learning](https://reader031.vdocuments.mx/reader031/viewer/2022021307/62074cc849d709492c300a3d/html5/thumbnails/26.jpg)
D: Policy space D0: Perf. optimal policies
D1: (1st) Bias optimal policies D2: 2nd Bias optimal policies
…… DM: Blackwell optimal policies
D
D0 D1
D2
D3 …
DM
Bias measures transient behavior
Mutli-Chain MDPsPerf./ Bias/ Blackwell Optimization
With perf. difference formulas, we can derive a simple, intuitive approach without discounting
![Page 27: Limitation of Markov Models and Event-Based Learning](https://reader031.vdocuments.mx/reader031/viewer/2022021307/62074cc849d709492c300a3d/html5/thumbnails/27.jpg)
27
Online gradient based optimi Online policy iteration
RLTD(λ), Q-learning, Neuro-DP ..
(online estimate)
Qgdd π
δη
=
Two policies: P, P’, Q=P’-PSteady-state prob: π, π’Long-run ave. perf: η, η’Poisson eq: (I-P+e π)g =f
PA
StochasticApproximation
Potentials g
Qg'' πηη =−
SACMDP(Policy iteration)(Policy gradient)
Gradient-based PI
RL: reinforcement learningPA: perturbation analysisMDP: Markov decision proc.SAC: stochastic adaptive cont.
A Map of the L&O World
![Page 28: Limitation of Markov Models and Event-Based Learning](https://reader031.vdocuments.mx/reader031/viewer/2022021307/62074cc849d709492c300a3d/html5/thumbnails/28.jpg)
28
Overview ofState-Based Optimization
Introduction toEvent-Based Optimization
Sensitivity-Based Approach toState-Based Optimization
Solution toEvent-Based Optimization
Extension of the sensitivity-based approachto event-based optimization
![Page 29: Limitation of Markov Models and Event-Based Learning](https://reader031.vdocuments.mx/reader031/viewer/2022021307/62074cc849d709492c300a3d/html5/thumbnails/29.jpg)
29
Two sensitivity formulas• Performance derivatives• Performance differencesPA & PI• PA: Choose the direction with largest average
change in expected potential at next step • PI: Choose the action with largest changes
in expected potential at next step Potentials are aggregated according to event structure
![Page 30: Limitation of Markov Models and Event-Based Learning](https://reader031.vdocuments.mx/reader031/viewer/2022021307/62074cc849d709492c300a3d/html5/thumbnails/30.jpg)
30
Solution to Random Walker Problem
0
1 2(100) (-100)
3 4(-100) (100)
α α−1
α α−1
a
b
p
qTwo policies:
),(ada =α )(bdb =α),('' ada =α )('' bdb =α
Apparently, g(a)>0 and g(b)<0 for any policy
Policy iteration: at any iteration choose and .Optimal policy: is the largest and is the smallest.*
aαaa αα >' bb αα <'
*bα
2. Performance deriv:
)]4()3([)()( θθθ θθαπ gg
ddb b −+
)]2()1([)()( θθθθ
θθαπ
θη gg
dda
dd a −=
)(θαa )(θαbContinuous with θ: ,
1. Performance diff:
π’(a), π’(b): perturbed steady-state prob. of events a and b
)]()')[(('' aga aa ααπηη −=−)]()')[((' bgb bb ααπ −+
)2()1()( ggag −= )4()3()( ggbg −=
Choose the action with the largest changesIn expected potential at next step
g(a), g(b) aggregated
![Page 31: Limitation of Markov Models and Event-Based Learning](https://reader031.vdocuments.mx/reader031/viewer/2022021307/62074cc849d709492c300a3d/html5/thumbnails/31.jpg)
31∑
−
=
−=1
0
)}()]()(')[({N
n
ndnnnpdd ααδη2. Performance deriv:
Solution to Admission Control Problem α(n)
1-α(n) Two policies: α(n) and α’(n)
Potential aggregation:p(n): prob. of arrival finding n cust.
∑∑ ∑∑∑
−=== =
+
nn
M
i nnii
ii
ngnpngnpqnp
nd )}()()]()([{)(
1)(1
0
∑−
=
−=−1
0
)}()]()(')[('{'N
n
ndnnnp ααηη1. Performance diff:
d(n)= changes in expected potentials of accepting and rejecting a cust.
Policy iteration: Choose α’(n) such that [α’(n) – α(n) ]d(n) >0 d(n) can be estimated on a sample path
Choose the action with the largest changesIn expected potential at next step
d(n): aggregated potential
![Page 32: Limitation of Markov Models and Event-Based Learning](https://reader031.vdocuments.mx/reader031/viewer/2022021307/62074cc849d709492c300a3d/html5/thumbnails/32.jpg)
32
Constructing NewSensitivity Eqs!
RL: reinforcement learningPA: perturbation analysisMDP: Markov decision proc.SAC: stochastic adaptive cont.
Sensitivity-Based Approaches to Event-Based Optimization
Gradient-based PI
MDP(Policy iteration)
Online gradient based optimi Online policy iteration
PA SAC
StochasticApproximation
RLTD(λ), Q-learning, Neuro-DP ..
(online estimate)
Qgdd π
δη
= Qg'' πηη =−
Potentials g
(Policy gradient)
agggeQe )|(*)('' πηη =−agggeQedd )|(*)(πδη
=
![Page 33: Limitation of Markov Models and Event-Based Learning](https://reader031.vdocuments.mx/reader031/viewer/2022021307/62074cc849d709492c300a3d/html5/thumbnails/33.jpg)
33
Summary
![Page 34: Limitation of Markov Models and Event-Based Learning](https://reader031.vdocuments.mx/reader031/viewer/2022021307/62074cc849d709492c300a3d/html5/thumbnails/34.jpg)
34
Advantages of the Event-Based Approach
2. # of aggregated potentials d(n): Nmay be linear in system
3. Actions at different states are correlatedstandard MDPs do not apply
4. Special features captured by eventsaction depends on future information
5. Opens up a new direction to many engineering problems
POMDPs: observation y as eventhierarchical control: mode change as event
network of networks: transitions among subnets as eventsLebesgue Sampling
1. May have better performance
![Page 35: Limitation of Markov Models and Event-Based Learning](https://reader031.vdocuments.mx/reader031/viewer/2022021307/62074cc849d709492c300a3d/html5/thumbnails/35.jpg)
35
1. A map of the learning and optimization world: Different approaches can be obtained from two
sensitivity equations2. Extension to event-based optimization
Policy iteration, perturbation analysis reinforcement learning, time aggregation
stochastic approximation, Lebesgue sampling……
3. Simpler and complete derivation for MDPsMulti-chains, different perf. criteria
Average performance with no discountingN-bias optimality – Blackwell optimality
Sensitivity-Based View of Optimization
![Page 36: Limitation of Markov Models and Event-Based Learning](https://reader031.vdocuments.mx/reader031/viewer/2022021307/62074cc849d709492c300a3d/html5/thumbnails/36.jpg)
36
a
b
0
1 2(100) (-100)
3 4(-100) (100)
α α−1
α α−1
1/2
1/2α small
α large
Pictures to Remember (I)
![Page 37: Limitation of Markov Models and Event-Based Learning](https://reader031.vdocuments.mx/reader031/viewer/2022021307/62074cc849d709492c300a3d/html5/thumbnails/37.jpg)
37Online gradient based optimi Online policy iteration
PA AC
StochasticApproximation
MDP(Policy iteration)(Policy gradient)
Gradient-based PI
RLTD(λ), Q-learning, Neuro-DP ..
(online estimate)
Qgdd π
δη
= Qg'' πηη =−
Potentials g
Constructing NewSensitivity Eqs!
agggeQe )|(*)('' πηη =−agggeQedd )|(*)(πδη
=
Pictures to Remember (II)
![Page 38: Limitation of Markov Models and Event-Based Learning](https://reader031.vdocuments.mx/reader031/viewer/2022021307/62074cc849d709492c300a3d/html5/thumbnails/38.jpg)
38
?!???????
?????
?????
0 Yautai
1 Alaska
2 Hawaii
Limitation of State-Based Formulation (I)
01
2
![Page 39: Limitation of Markov Models and Event-Based Learning](https://reader031.vdocuments.mx/reader031/viewer/2022021307/62074cc849d709492c300a3d/html5/thumbnails/39.jpg)
39
Thank You!
![Page 40: Limitation of Markov Models and Event-Based Learning](https://reader031.vdocuments.mx/reader031/viewer/2022021307/62074cc849d709492c300a3d/html5/thumbnails/40.jpg)
40
Xi-Ren Cao:
Stochastic Learningand Optimization- A Sensitivity BasedApproach
9 Chapters, 566 pages119 Figures, 27 Tables, 212 homework problems
SpringerOctober 2007