1
Testing Stochastic Processes Through
Reinforcement Learning
François Laviolette
Sami Zhioua
Nips-Workshop
December 9th, 2006
Josée Desharnais
2
Outline
Program Verification Problem
The Approach for trace-equivalence
Other equivalences
Conclusion
Application on MDPs
3
Stochastic Program Verification
Specification (LMP):an MDP without rewards
Implementation
s0
s1
s3
s6
s2
s4 s5
a[0.5]a[0.3]
b[0.9]cb[0.9]
c
How far the Implementation is from the Specification ?
(Distance or divergence)
The Specification model is available.
The Implementation is available only for interaction (no model).
4
1. Non deterministic trace equivalence
P
a
a c
b
cb
ac
c b b
Q
a
b a
c
cb
aa
b
c
a
b
Trace Equivalence
Two systems are trace equivalent iff they accept the same set of traces
T(P) = {a, aa, aac, ac, b, ba, bab,
c, cb,cc}T(Q) = {a, ab, ac, abc, abca,
ba, bab, c, ca}
2. Probabilistic trace equivalence
Two systems are trace equivalent iff they accept the same set of traces and with the same probabilities
P
a[2/3]
a[1/3] b[2/3]
a[1/4]
cb
a[1/4]a[3/4]
c
a
b[1/2] c[1/2]
a 7/12
aa 5/12
aac 1/6
bc 2/3
…
Q
a[1/3]
a[1/2] a[1/2]
b
cb
a[1/4]a[3/4]
b[1/2]
c
a
b[1/2]
a 1
aa 1/2
aac 0
bc 0
…
5
Testing (Trace Equivalence)
The system is a black box.
The button goes down (transition)
The button does not go down (no transition)
When a button is pushed
(action execution)
Grammar (trace equiv):
t ::= | a.t
Observations :
When a test is executed, several observations are possible : O t.
b[0.7]
s0
s3
a[0.2]a[0.5]
[2,4) [7,10]
Example:
Ot = {a, a.b, a.b}
0.3 0.56
t = a.b.
0.14
a b z
6
Outline
Program Verification Problem
The Approach for trace-equivalence
Other equivalences
Conclusion
Application on MDPs
7
Why Reinforcement Learning ?
s0
s1
s4
s2
s5 s6
a[0.2]a[0.5]
b[0.7]a[0.3]a
s7
b
s3
b[0.9]
a[0.7]
s8
s0
s1 s2 s3
s4s6
s7 s8
s5
a b
a a b
ab
LMP
MDP
Reinforcement Learning is particularly efficient in the absence of the full model.
0.5 0.2 0.9
10.3
0.7
1 0.7
Reinforcement Learning can deal with bigger systems.
Analogy :
LMP MDP
Trace Policy
Divergence Optimal Value ( V* )
8
A Stochastic Game towards RL
F S S F S F S F F S F S
F F S S S F
S S S F F F
+ 10
- 1
b[0.7]
s0
s1
s3
s6
s2
s4 s5
a[0.2]a[0.5]
b[0.3]a
c[0.4]
s7
c[0.2]
s10
b
s8
b
Implementation Specifications0
s1
s3
s2
s4 s5
a[0.2]a[0.3]
b[0.7]b[0.3]a
s7 s9
c[0.8]c[0.7]
s10
b
s8
b
b[0.9]
Specification (clone)s0
s1
s3
s2
s4 s5
a[0.2]a[0.3]
b[0.7]b[0.3]a
s7 s9
c[0.8]c[0.7]
s10
b
s8
b
b[0.9]
Reward : (+1) when Impl Spec
Reward : (-1) when Spec Clone
9
MDP Defintion
MDP : Specification LMP StatesActionsNext-state probability distribution
MDPs0
s1
s3
s6
s2
s4 s5
a[0.2]a[0.5]
b[0.7]b[0.3]a
c[0.4]
s7
c[0.2]
s10
b
s8
b
s0
s1
s3
s2
s4 s5
a[0.2]a[0.5]
b[0.7]b[0.3]a
s7 s9
c[0.8]c[0.7]
s10
b
s8
b
Implémentation Spécification
b[0.9]
s0
s1 s2 s3
s3s4
s8 s9
s5
s7
s10
0.5 0.2 0.9
1 0.3 0.7
1 0.80.7
1
a b
a b
cbc
b
Dead
10
Divergence Computation
F S S
F S F
S F F
S F S
F F S S S F
S S S F F F
+ 1
0
- 1
V*(s0)
0 : Equivalent
1 : Different
*
s0
s1
s3
s6
s2
s4 s5
a[0.2]a[0.5]
b[0.7]b[0.3]a
c[0.4]
s7
c[0.2]
s10
b
s8
b
s0
s1
s3
s2
s4 s5
a[0.2]a[0.5]
b[0.7]b[0.3]a
s7 s9
c[0.8]c[0.7]
s10
b
s8
b
Implementation Specification
b[0.9]
MDPs0
s1 s2 s3
s3s4
s8 s9
s5
s7
s10
0.5 0.2 0.9
1 0.3 0.7
1 0.80.7
1
a b
a b
cbc
b
Dead
11
Symmetry Problem
Implementation Specification
F S S S F F
F F S S S F
+ 1 - 1
Create two variants for each action (a):
Success variant ( a )
Failure variant ( a )
s0
s1
a[1]
s0
s1
a[0.5]
Spec (Clone)
s0
s1
a[0.5]
Compute and give reward
Give reward 0
Select action make a prediction (, ×)
If pred = obs
If pred obs
Prediction:
execute action
Prob=0*.5*.5+1*.5*.5 = .25Prob=0*.5*.5+1*.5*.5 = .25
12
The Divergence (with the symmetry problem fixed)
Theorem. Let "Spec" and "Impl" be two LMPs, and M their induced MDP.
V*(s0) ≥ 0, and
V*(s0) = 0 iff "Spec" and "Impl" are trace-equivalent.
13
Implementation and PAC Guaranty
There exists a PAC Guaranty for Q-Learning Algorithm but ..
Fiechter algorithm has a simpler PAC guaranty.
Besides, it is possible to obtain a bottom bound thanks to the Hoeffding inequality :
If then :
Implementation :
= 0.8
Action selection : softmax ( decreasing from 0.8 to 0.01)
RL algorithm : Q-Learning
decreasing according to the function 1/x
PAC guaranty :
14
Outline
Program Verification Problem
The Approach for trace-equivalence
Other equivalences
Conclusion
Application on MDPs
15
Testing (Bisimulation)
The system is a black box.
Grammar
t ::= | a.t
a b z
b[0.7]
s0
s3
a[0.2]a[0.5]
[2,4) [7,10]
Example:
Ot = {a, a.(b, b), a.(b,b), a.(b,b), a.(b,b)}
0.3 0.518
t = a.(b,b)
0.042 0.042 0.098Pt,s0 :
Replication
| (t1, … , tn)
(bisimulation) :
16
P
a
c
b[1/3] c[2/3]
c
a[1/3] a[2/3]
b
c
Q
New Equivalence Notion
‘’By-Level Equivalence’’
17
K-Moment Equivalence
t ::= | a.t
t ::= | ak.t k 2
1-moment (trace)
2-moment
3-moment t ::= | ak.t k 3
: is a random variable such that is the probability to perform
the trace and make a transition to a state that accepts action a with probability pi .
is equal toTwo systems are “By-level’’ equivalent
Recall : kth moment of X = E(Xk) = ( xik . Pr(X=xi) )
k
18
Ready Equivalence and Failure equivalence
1. Ready Equivalence
Two systems are Ready equivalent iff for any trace tr and any set of actions A, they have the same probability to run successfully tr and reach a process accepting all actions from A.
.
P
a[1/3]
a[1/3] a[2/3]
b
cb
a[1/4]a[3/4]
c
a
b[1/2] b[1/2]
Q
a[1/3]
a[1/2] a[1/2]
b
cb
a[1/4]a[3/4]
b[1/2]
c
a
b[1/2]
(<a>,{b,c}) 2/3 (<a>,{b,c}) 1/2
Test t ::= | a.t | {a1, .. , an}
1. Failure Equivalence
P
a[1/3]
a[1/3] a[2/3]
b
cb
a[1/4]a[3/4]
c
a
b[1/2] b[1/2]
Q
a[1/3]
a[1/2] a[1/2]
b
cb
a[1/4]a[3/4]
b[1/2]
c
a
b[1/2]
(<a>,{b,c}) 1/3 (<a>,{b,c}) 1/2
Two systems are Ready equivalent iff for any trace tr and any set of actions A, they have the same probability to run successfully tr and reach a process refusing all actions from A.
Test t ::= | a.t | {a1, .. , an}
19
1. Barb acceptation
P
a[1/3]
a[1/3] a[2/3]
b
cb
a[1/4]a[3/4]
c
a
b[1/2] b[1/2]
Q
a[1/3]
a[1/2] a[1/2]
b
cb
a[1/4]a[3/4]
b[1/2]
c
a
b[1/2]
Barb equivalence
(<a,b>,<{a,b},{b,c},>) 2/3
2. Barb Refusal
P
a[1/3]
a[1/3] a[2/3]
b
cb
a[1/4]a[3/4]
c
a
b[1/2] b[1/2]
Q
a[1/3]
a[1/2] a[1/2]
b
cb
a[1/4]a[3/4]
b[1/2]
c
a
b[1/2]
(<a,b>,<{b,c},{b,c}>) 1/3
Test t ::= | a.t | {a1, .. , an}a.t
Test t ::= | a.t | {a1, .. , an}a.t
20
Outline
Program Verification Problem
The Approach for trace-equivalence
Other equivalences
Conclusion
Application on MDPs
21
MDP 1
s0
s1 s2 s3
s3s4
s8 s9
s5
s7
a b
a b
cbc
0.8 0.2 1
1 0.3 0.7
1 11
r1 r2 r3
r3 r4 r5
r7 r8r6
s0
s1 s2 s3
s4s6
s7 s8
s5
a b
a a b
ab
0.5 0.2 0.9
10.3
0.7
1 0.7
r1 r2 r3
r3 r4 r5
r7 r8
Application on MDPs
MDP 2
Case 3 : The reward space is very large (continuous) : w.l.o.g. [0,1]
Case 1 : The reward space contains 2 values (binary) : 0 and 1
Case 2 : The reward space is small (discrete) : {r1, r2, r3, r4, r5}
22
Application on MDPs
Case 1 : The reward space contains 2 values (binary)
r1 : 0 F
r2 : 1 S
Case 2 : The reward space is small (discrete)
{r1, r2, r3, r4, r5}
ar1 a
r2 ar3 a
r4 ar5
br1 b
r2 br3 b
r4 br5
F
S
Case 3 :
The reward space is very large (continuous)
Intuition : r = 3/41 with probability 3/4
a rpick a reward value (ranVal)
randomly
ranVal r
ranVal < r
S
F
0 with probability 1/4
23
Current and Future Work
Application to different equivalence notions :- Failure equivalence- Ready equivalence- Barb equivalence, etc.
Experimental analysis on realistic systems
Applying the approach to compute the divergence between : - HMMs
- POMDPs
Studying the properties of the divergence
- Probabilistic automata