![Page 1: Optimal Tuning of Continual Online Exploration in Reinforcement Learning Youssef Achbany, Francois Fouss, Luh Yen, Alain Pirotte & Marco Saerens Information](https://reader036.vdocuments.mx/reader036/viewer/2022062714/56649d535503460f94a2f506/html5/thumbnails/1.jpg)
Optimal Tuning of Continual Online Exploration in
Reinforcement Learning
Youssef Achbany, Francois Fouss, Luh Yen, Alain Pirotte & Marco Saerens
Information Systems Research Unit (ISYS)Université de Louvain
Belgium
![Page 2: Optimal Tuning of Continual Online Exploration in Reinforcement Learning Youssef Achbany, Francois Fouss, Luh Yen, Alain Pirotte & Marco Saerens Information](https://reader036.vdocuments.mx/reader036/viewer/2022062714/56649d535503460f94a2f506/html5/thumbnails/2.jpg)
Achbany Youssef - UCL 2
Outline
Introduction Mathematical concepts Modelling exploration by entropy Optimal policy Preliminary experiments Conclusion and further work
![Page 3: Optimal Tuning of Continual Online Exploration in Reinforcement Learning Youssef Achbany, Francois Fouss, Luh Yen, Alain Pirotte & Marco Saerens Information](https://reader036.vdocuments.mx/reader036/viewer/2022062714/56649d535503460f94a2f506/html5/thumbnails/3.jpg)
Achbany Youssef - UCL 3
Introduction One of the challenges of reinforcement
learning is to manage: The tradeoff between exploration and
exploitation. Exploitation
aims to capitalize on already well-established solutions.
Exploration: aims to continually try new ways of solving the
problem. is relevant when the environment is changing.
![Page 4: Optimal Tuning of Continual Online Exploration in Reinforcement Learning Youssef Achbany, Francois Fouss, Luh Yen, Alain Pirotte & Marco Saerens Information](https://reader036.vdocuments.mx/reader036/viewer/2022062714/56649d535503460f94a2f506/html5/thumbnails/4.jpg)
Achbany Youssef - UCL 4
Introduction Simple routing problem
The goal is to reach a destination node (13) From an initial node (1) To minimize costs
For each node Set of admissible actions Weight (cost) associated We define a probability distribution on the set of admissible actions
1
2
5
4
3
9
8
7
6
10
11
12
14
13
1
11
1
1
11
1
1
22
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
![Page 5: Optimal Tuning of Continual Online Exploration in Reinforcement Learning Youssef Achbany, Francois Fouss, Luh Yen, Alain Pirotte & Marco Saerens Information](https://reader036.vdocuments.mx/reader036/viewer/2022062714/56649d535503460f94a2f506/html5/thumbnails/5.jpg)
Achbany Youssef - UCL 5
Mathematical concepts
We have a set of states, S = {1, 2, …,n} st = k means that the system is in state k
at time t
In each state s = k, we have a set of admissible control actions, U(k) So that u(k) U(k) is a control action
available at state k
![Page 6: Optimal Tuning of Continual Online Exploration in Reinforcement Learning Youssef Achbany, Francois Fouss, Luh Yen, Alain Pirotte & Marco Saerens Information](https://reader036.vdocuments.mx/reader036/viewer/2022062714/56649d535503460f94a2f506/html5/thumbnails/6.jpg)
Achbany Youssef - UCL 6
Mathematical concepts
When we choose action u(st) at state st, A bounded cost C(u(st)| st) < ∞ is incurred The system jumps to state st+1 = f(u(st)| st)
Where f is a function
We suppose the network of states does not contain any negative cycle
![Page 7: Optimal Tuning of Continual Online Exploration in Reinforcement Learning Youssef Achbany, Francois Fouss, Luh Yen, Alain Pirotte & Marco Saerens Information](https://reader036.vdocuments.mx/reader036/viewer/2022062714/56649d535503460f94a2f506/html5/thumbnails/7.jpg)
Achbany Youssef - UCL 7
Mathematical concepts
For each state s, we define a probability distribution on the set of admissible actions, P(u(s)| s)
Meaning that the choice is randomized This introduces exploration – not only
exploitation This is the main contribution of our
work
![Page 8: Optimal Tuning of Continual Online Exploration in Reinforcement Learning Youssef Achbany, Francois Fouss, Luh Yen, Alain Pirotte & Marco Saerens Information](https://reader036.vdocuments.mx/reader036/viewer/2022062714/56649d535503460f94a2f506/html5/thumbnails/8.jpg)
Achbany Youssef - UCL 8
Mathematical concepts
For instance if, in state s = k, there are three admissible actions,
The probability distribution P(u(k)| s=k) involves three values
k
uk1
P(uk
1|k)
uk2
uk3
P(u k3 |k)
P(uk2|k)
![Page 9: Optimal Tuning of Continual Online Exploration in Reinforcement Learning Youssef Achbany, Francois Fouss, Luh Yen, Alain Pirotte & Marco Saerens Information](https://reader036.vdocuments.mx/reader036/viewer/2022062714/56649d535503460f94a2f506/html5/thumbnails/9.jpg)
Achbany Youssef - UCL 9
Mathematical concepts
The policy is defined as the set of all probability distributions for all states
1
2
5
4
3
9
8
7
6
10
11
12
14
13
1
11
1
1
11
1
1
22
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
![Page 10: Optimal Tuning of Continual Online Exploration in Reinforcement Learning Youssef Achbany, Francois Fouss, Luh Yen, Alain Pirotte & Marco Saerens Information](https://reader036.vdocuments.mx/reader036/viewer/2022062714/56649d535503460f94a2f506/html5/thumbnails/10.jpg)
Achbany Youssef - UCL 10
Mathematical concepts The goal is to reach a destination state,
s = d From an initial state, s0 = k0
While minimizing the total expected cost
The expectation is taken on the policy, that is, on all the random variables u(k) associated to the states
![Page 11: Optimal Tuning of Continual Online Exploration in Reinforcement Learning Youssef Achbany, Francois Fouss, Luh Yen, Alain Pirotte & Marco Saerens Information](https://reader036.vdocuments.mx/reader036/viewer/2022062714/56649d535503460f94a2f506/html5/thumbnails/11.jpg)
Achbany Youssef - UCL 11
Mathematical concepts In other words, we have to determine
the best policy that minimizes V(k0) That is, the best probability distributions
This is standard, except the fact that we introduce choice randomisation
![Page 12: Optimal Tuning of Continual Online Exploration in Reinforcement Learning Youssef Achbany, Francois Fouss, Luh Yen, Alain Pirotte & Marco Saerens Information](https://reader036.vdocuments.mx/reader036/viewer/2022062714/56649d535503460f94a2f506/html5/thumbnails/12.jpg)
Achbany Youssef - UCL 12
Mathematical concepts We now introduce a way to control
exploration
We introduce the degree of exploration, Ek, defined on each state k Which is the entropy of the
probability distribution of actions in this state k
![Page 13: Optimal Tuning of Continual Online Exploration in Reinforcement Learning Youssef Achbany, Francois Fouss, Luh Yen, Alain Pirotte & Marco Saerens Information](https://reader036.vdocuments.mx/reader036/viewer/2022062714/56649d535503460f94a2f506/html5/thumbnails/13.jpg)
Achbany Youssef - UCL 13
Modelling exploration by entropy The degree of exploration, Ek, is
defined as the entropy at state k
The minimum is 0 (no exploration) The maximum is log(nk) where nk is the
number of admissible actions in state k (full exploration)
![Page 14: Optimal Tuning of Continual Online Exploration in Reinforcement Learning Youssef Achbany, Francois Fouss, Luh Yen, Alain Pirotte & Marco Saerens Information](https://reader036.vdocuments.mx/reader036/viewer/2022062714/56649d535503460f94a2f506/html5/thumbnails/14.jpg)
Achbany Youssef - UCL 14
Modelling exploration by entropy
While the exploration rate is defined as
and takes its value between 0 (no exploration)
and 1 (full exploration).
![Page 15: Optimal Tuning of Continual Online Exploration in Reinforcement Learning Youssef Achbany, Francois Fouss, Luh Yen, Alain Pirotte & Marco Saerens Information](https://reader036.vdocuments.mx/reader036/viewer/2022062714/56649d535503460f94a2f506/html5/thumbnails/15.jpg)
Achbany Youssef - UCL 15
Modelling exploration by entropy The goal now is to determine the
optimal policy under exploration constraints That is, seek the policy, *, among
for which the expected cost, V(k0), is minimal
while guarantying a given degree of exploration (entropy) in each state k
![Page 16: Optimal Tuning of Continual Online Exploration in Reinforcement Learning Youssef Achbany, Francois Fouss, Luh Yen, Alain Pirotte & Marco Saerens Information](https://reader036.vdocuments.mx/reader036/viewer/2022062714/56649d535503460f94a2f506/html5/thumbnails/16.jpg)
Achbany Youssef - UCL 16
Modelling exploration by entropy In other words,
where the Ek are provided/fixed by the user/designer
They control the degree of exploration at each node k
![Page 17: Optimal Tuning of Continual Online Exploration in Reinforcement Learning Youssef Achbany, Francois Fouss, Luh Yen, Alain Pirotte & Marco Saerens Information](https://reader036.vdocuments.mx/reader036/viewer/2022062714/56649d535503460f94a2f506/html5/thumbnails/17.jpg)
Achbany Youssef - UCL 17
Modelling exploration by entropy Thus, we route the agents as fast
as possible, while exploring the network
1
2
5
4
3
9
8
7
6
10
11
12
14
13
1
11
1
1
11
1
1
22
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
![Page 18: Optimal Tuning of Continual Online Exploration in Reinforcement Learning Youssef Achbany, Francois Fouss, Luh Yen, Alain Pirotte & Marco Saerens Information](https://reader036.vdocuments.mx/reader036/viewer/2022062714/56649d535503460f94a2f506/html5/thumbnails/18.jpg)
Achbany Youssef - UCL 18
Optimal policy Here are the necessary optimality
conditions (for a local minimum), very similar to Bellman’s equations V
*(k) is the optimal expected cost from state k
P(i|k) is the probability of chosing action i satisfying the entropy constraint through k
![Page 19: Optimal Tuning of Continual Online Exploration in Reinforcement Learning Youssef Achbany, Francois Fouss, Luh Yen, Alain Pirotte & Marco Saerens Information](https://reader036.vdocuments.mx/reader036/viewer/2022062714/56649d535503460f94a2f506/html5/thumbnails/19.jpg)
Achbany Youssef - UCL 19
Optimal policy
Which lead to the following updating rules Convergence has been proved in a
stationary environment
![Page 20: Optimal Tuning of Continual Online Exploration in Reinforcement Learning Youssef Achbany, Francois Fouss, Luh Yen, Alain Pirotte & Marco Saerens Information](https://reader036.vdocuments.mx/reader036/viewer/2022062714/56649d535503460f94a2f506/html5/thumbnails/20.jpg)
Achbany Youssef - UCL 20
Optimal policy This updating rule has a nice
interpretation: Route the agents preferably (with probability
P(i|k)) to the state from which the expected cost is minimal
Including the direct cost for reaching this state
1
2
5
4
3
9
8
7
6
10
11
12
14
13
1
11
1
1
11
1
1
22
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
![Page 21: Optimal Tuning of Continual Online Exploration in Reinforcement Learning Youssef Achbany, Francois Fouss, Luh Yen, Alain Pirotte & Marco Saerens Information](https://reader036.vdocuments.mx/reader036/viewer/2022062714/56649d535503460f94a2f506/html5/thumbnails/21.jpg)
Achbany Youssef - UCL 21
Optimal policy
If k is large (zero entropy: no exploration), we obtain
which is the common value iteration algorithm or Bellman’s equation
for finding the shortest path
![Page 22: Optimal Tuning of Continual Online Exploration in Reinforcement Learning Youssef Achbany, Francois Fouss, Luh Yen, Alain Pirotte & Marco Saerens Information](https://reader036.vdocuments.mx/reader036/viewer/2022062714/56649d535503460f94a2f506/html5/thumbnails/22.jpg)
Achbany Youssef - UCL 22
Optimal policy If k is zero (maximum entropy: full
exploration), We perform a blind exploration
We estimate the « average first passage time »
Without taking the costs into consideration:
where nk is the number of admissible actions in state k
![Page 23: Optimal Tuning of Continual Online Exploration in Reinforcement Learning Youssef Achbany, Francois Fouss, Luh Yen, Alain Pirotte & Marco Saerens Information](https://reader036.vdocuments.mx/reader036/viewer/2022062714/56649d535503460f94a2f506/html5/thumbnails/23.jpg)
Achbany Youssef - UCL 23
Advantages of our algorithm Our strategy could be interesting if the
environment is changing And there is a need for continuous exploration
Indeed, if no exploration is performed, The agent will not notice the changes unless
they occur on the shortest path So that the policy will not be adjusted
In other words, we propose an optimal exploration/exploitation trade-off
![Page 24: Optimal Tuning of Continual Online Exploration in Reinforcement Learning Youssef Achbany, Francois Fouss, Luh Yen, Alain Pirotte & Marco Saerens Information](https://reader036.vdocuments.mx/reader036/viewer/2022062714/56649d535503460f94a2f506/html5/thumbnails/24.jpg)
Achbany Youssef - UCL 24
Preliminary experiments Simple Network
routing Dynamic Uncertain
1
2
5
4
3
9
8
7
6
10
11
12
14
13
1
11
1
1
11
1
1
22
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
![Page 25: Optimal Tuning of Continual Online Exploration in Reinforcement Learning Youssef Achbany, Francois Fouss, Luh Yen, Alain Pirotte & Marco Saerens Information](https://reader036.vdocuments.mx/reader036/viewer/2022062714/56649d535503460f94a2f506/html5/thumbnails/25.jpg)
Achbany Youssef - UCL 25
Preliminary experiments Exploration rate of 0% for all nodes (no
exploration)
1
4
3
2
5
6
7
8
9
12
11
10
13
14
White: very low trafficLight gray: low trafficGray: medium trafficDark gray: high trafficBlack: very high traffic
![Page 26: Optimal Tuning of Continual Online Exploration in Reinforcement Learning Youssef Achbany, Francois Fouss, Luh Yen, Alain Pirotte & Marco Saerens Information](https://reader036.vdocuments.mx/reader036/viewer/2022062714/56649d535503460f94a2f506/html5/thumbnails/26.jpg)
Achbany Youssef - UCL 26
Preliminary experiments Entropy rate of 30% for all nodes
1
4
3
2
5
6
7
8
9
12
11
10
13
14
White: very low trafficLight gray: low trafficGray: medium trafficDark gray: high trafficBlack: very high traffic
![Page 27: Optimal Tuning of Continual Online Exploration in Reinforcement Learning Youssef Achbany, Francois Fouss, Luh Yen, Alain Pirotte & Marco Saerens Information](https://reader036.vdocuments.mx/reader036/viewer/2022062714/56649d535503460f94a2f506/html5/thumbnails/27.jpg)
Achbany Youssef - UCL 27
Preliminary experiments Entropy rate of 60% for all nodes
1
4
3
2
5
6
7
8
9
12
11
10
13
14
White: very low trafficLight gray: low trafficGray: medium trafficDark gray: high trafficBlack: very high traffic
![Page 28: Optimal Tuning of Continual Online Exploration in Reinforcement Learning Youssef Achbany, Francois Fouss, Luh Yen, Alain Pirotte & Marco Saerens Information](https://reader036.vdocuments.mx/reader036/viewer/2022062714/56649d535503460f94a2f506/html5/thumbnails/28.jpg)
Achbany Youssef - UCL 28
Preliminary experiments Entropy rate of 90% for all nodes
1
4
3
2
5
6
7
8
9
12
11
10
13
14
White: very low trafficLight gray: low trafficGray: medium trafficDark gray: high trafficBlack: very high traffic
![Page 29: Optimal Tuning of Continual Online Exploration in Reinforcement Learning Youssef Achbany, Francois Fouss, Luh Yen, Alain Pirotte & Marco Saerens Information](https://reader036.vdocuments.mx/reader036/viewer/2022062714/56649d535503460f94a2f506/html5/thumbnails/29.jpg)
Achbany Youssef - UCL 29
Preliminary experiments
Other experimental simulations are provided in: Tuning continual exploration in
reinforcement learning (Technical report submitted for publication).
http://www.isys.ucl.ac.be/staff/francois/Articles/Achbany2005a.pdf
![Page 30: Optimal Tuning of Continual Online Exploration in Reinforcement Learning Youssef Achbany, Francois Fouss, Luh Yen, Alain Pirotte & Marco Saerens Information](https://reader036.vdocuments.mx/reader036/viewer/2022062714/56649d535503460f94a2f506/html5/thumbnails/30.jpg)
Achbany Youssef - UCL 30
Conclusion In this work,
we presented a model integrating both exploration and exploitation in a common framework.
The exploration rate is controlled by the entropy of the choice probability distribution defined on the states of the system.
When no exploration is performed (zero entropy on each node), the model reduces to the common value iteration algorithm computing the minimum cost policy.
On the other hand, when full exploration is performed (maximum entropy on each node), the model reduces to a "blind" exploration, without considering the costs.
![Page 31: Optimal Tuning of Continual Online Exploration in Reinforcement Learning Youssef Achbany, Francois Fouss, Luh Yen, Alain Pirotte & Marco Saerens Information](https://reader036.vdocuments.mx/reader036/viewer/2022062714/56649d535503460f94a2f506/html5/thumbnails/31.jpg)
Achbany Youssef - UCL 31
Further work
This model has been extended to Stochastic shortest paths problems Discounted problems Acyclic graphs Edit-distances between string Developing links with Q-learning
![Page 32: Optimal Tuning of Continual Online Exploration in Reinforcement Learning Youssef Achbany, Francois Fouss, Luh Yen, Alain Pirotte & Marco Saerens Information](https://reader036.vdocuments.mx/reader036/viewer/2022062714/56649d535503460f94a2f506/html5/thumbnails/32.jpg)
Achbany Youssef - UCL 32
Thank you !!!