second world conference on pom and 15th pom …

SECOND WORLD CONFERENCE ON POM AND 15TH POM CONFERENCE, CANCUN, MEXICO, APRIL 30 - MAY 3, 2004. 1

Global Supply Chain Management based on

Collective IntelligenceLuis Rocha-Mier1, Leonid Sheremetov, Miguel Contreras, César Osuna, Manuel Romero, Luis Villa,

Ana Hernández

Mexican Petroleum Institute (IMP)

Program of Research in Applied

Mathematics and Computing

Eje Central Lázaro Cárdenas 152

Col. San Bartolo Atepehuacan, 07730

Mexico City, Mexico

Tel. + (52) 55 9175-7274, Fax + (52) 55 9175-6277

email: {lrocha,sher,mcontreras,cosuna,mromeros,lvilla,[email protected]}

Abstract number: 002-0379

Abstract

An approach to the problem of optimization of local decisions to assure global optimization in supply chain

performance is developed within the framework of a NEural COllective INtelligence (NECOIN). The proposed

framework focuses on the interactions at the local and the global levels among the agents in order to improve

the overall supply chain business process behavior. A COIN is defined as a large Multi-Agent System (MAS) whit no

centralized control and communication, but where there is a global task to complete. In addition, learning consists of

adapting the local behavior of each entity with the aim of optimizing a given global behavior. Reinforcement learning

algorithms are used at the local level, while generalization of Q-neural algorithm is used to optimize the global

behavior. The proposed framework was implemented using Netlogo (agent-based parallel modeling and simulation

environment) to test the proposed optimization algorithm. The work demonstrates that Supply Chain Management

(SCM) is a good experimental field for the application of the NECOIN theory.

Index Terms

Supply Chain Management (SCM), Multi-Agent System (MAS) Learning, Reinforcement Learning (RL), CMAC

Neural Networks (NN), Neural Collective Intelligence (NECOIN).

1To whom correspondence should be addressed.


I. INTRODUCTION

In order to understand the significance of changes taking place in enterprise integration initiatives, like Supply

Chains (SC), it would be prudent to review trends in production and operations management [Eloranta et al.99].

First, the strong worldwide competition and the highly specified customers’ requirements towards product quality,

delivery time, and services force the industry to a permanent optimization of their production. Second, the change

towards demand driven production implies that not management of supplies but demands of customer should

trigger and influence the production processes. As a consequence, logistics gets a new focus on optimization of the

production process in a very dynamic environment. Finally, though there are many solutions and techniques for local

optimization (e.g. planning and scheduling systems, inventory management systems, market trading optimization

systems, etc.), usually these decisions do not assure the overall business optimization at the global level because of

the conflicts between the local goals [Julka et al.02b].

The Supply Chain Management (SCM) problem can be defined as the management of relationships across a

supply chain to capture the synergy of intra- and inter-company business processes with the aim of optimizing

the overall business process of the enterprise (e.g. on-time delivery, quality assurance, and cost minimization)

[Lambert et al.00]. The simple integration of the traditional techniques is not enough to assure global optimization

due to the inherent complexity of the problem. For example, [Dreher99] has developed a complexity index for

VOLKSWAGEN showing that an automobile is manufactured from between 3.000 up to more then 20.000 parts. As

shown in [JE et al.00], an integrated distributed production planning system for this supply chain (SC) coordinated

and controlled centrally (like described in Figure 1) would cause a lot of different problems:

• Centers of control are bottlenecks.

• Centers must have complete knowledge for decision-making.

• Confidential internal information must be provided to the center.

• Difficult reorganization of the chains.

• Planning in the complete supply chain can be very complex.

• Data consistency is not guaranteed in decentralized structures.

That is why, in this work the SCM problem is addressed within the context of the NEural COllective INtelligence

(NECOIN) theory [Wolpert et al.99], [RM02], [Ferber97] and the adaptation of the Q-neural algorithm [RM02].

According to our conceptualization of the SCM problem within the NECOIN, a SC is a large Multi-Agent System

(MAS) where:

• One of the main objectives is the decentralization of the control and communication.

• Each node of the SC is represented as an agent with autonomous behavior and a local utility function.

• The learning process consists of adapting the “local behavior” of each SC node (agent) with the aim of

optimizing a given “global behavior” of the SC.

• The agents execute Reinforcement Learning (RL) algorithms at the local level and use Neural Networks to

support the learning process.


Fig. 1. Real-time communication infrastructure (Adapted from [JE et al.00])

This conceptualization permits to handle the SCM problem within the following context:

• The environment is of a distributed and dynamic nature.

• The model of the behavior of the environment at the beginning is unknown.

• The individual behavior of each entity affects the total behavior of the system. One of the problems is to find

out, which part of the system affects the total behavior. Another problem is to know the degree of responsibility

for each entity also known as ’Credit Assignment Problem.’

• The number of states and variables is very significant.

• The entities must adapt their decisions online.

In this paper, we study how each agent contributes (or not) to the overall business performance. More precisely,

how can it be determined if agent’s local decisions are contributing to optimizing the overall business performance.

The paper is organized as follows: section 2 presents the Supply Chain Management problem. Section 3 the

theory of the Dynamic Programming. Section 4 describes the Reinforcement Learning (RL) theory. Section 5

presents the NEural COllective INtelligence theory. Section 6 describes the NECOIN theory as a framework to

address the Supply Chain Management in a proposed scenario. A simplification of the Q-neural algorithm adapted


to the scenario of the job routing for the production supply chain is described in Section 7. Section 8 presents

a case study description and the simulation results. Section 9 discusses the advantages and disadvantages of our

approach, concludes the paper and proposes further work. Finally, the appendix describes the notation used.

II. SUPPLY CHAIN MANAGEMENT PROBLEM

A supply chain is defined as a network of suppliers, factories, warehouses, distribution centers, and retailers

through which raw materials are acquired, transformed, and delivered to customers. The performance of any entity

in a supply chain depends on the performance of others, and their ability to coordinate their activities.

The SCM problem is complex due to the followings features that must be considered to give a viable solution

[Shen et al.99]:

a. Enterprise Integration: The different departments and processes (e.g. purchasing, orders, design, production,

planning and scheduling, control, transport, resources, personnel, materials, quality, etc.) must be integrated

in order to support global competitiveness and rapid market responsiveness.

b. Distributed: The different entities in the chain that might be distributed across different geographical locations

must be considered as a whole by using distributed knowledge-based system which permits linking demand

management directly to resource and capacity planning and scheduling. For example, the final product storages

facilities that might be located in different cities need to know the global stock inventory.

c. Disparate: The departments in the supply chain might use different software and hardware built on different

platforms. The information of every agent would be disparate. For example: Product tracking of shipment

might be through emails, faxes, telephone calls, etc.

d. Dynamic: The supply chain is changing, there are no obligations to be part of the supply chain for a certain

period and the elements may join or leave the chain based on their own interest. Moreover, the information is

changing continuously (e.g. price, demands, etc.). In addition, there is pressure to accommodate the integration

of new software, hardware or manufacturing devices.

e. Cooperation: The different elements in the supply chain must cooperate in order to achieve the global enterprise

objective (e.g. overall business performance).

f. Integration of humans with software and hardware: The integration of people and computers is necessary in

order to permit rapid access and communication of the required knowledge and information.

g. Online adaptation: The SCM optimization process must be oriented to reducing the product cycle time to be

able to respond to customer demands in real time. Also, each element in the chain must be able to adapt to

the changes in the environment on the fly.

h. Scalability: The supply chain system must be able to incorporate resources into the chain as required without

decreasing the overall business performance.

i. Fault Tolerance: The system must be able to recover from resources failures at any level and to minimize

their impacts on the overall business performance.


In this paper, the SCM problem is formulated as optimization of the behavior of the agents based on the

implementation of the Q-neural algorithm. This algorithm executes Reinforcement Learning based algorithms at

the local level and the NEural COllective INtelligence theory in order to optimize the overall business performance

of the SC. The RL is based on the more general Dynamic Programming theory, more particularly on the Bellman

equation, briefly introduced in the following section.

III. DYNAMIC PROGRAMMING

In this section the Dynamic Programming theory is explained. This theory is the base of the RL methods 1 and

the Shortest Path Algorithms (see [Sakarovitch84], [Bellman58] and [Ford et al.62]).

“Dynamic Programming (DP)” 2 is a technique which addresses the problems which arise in situations where

decisions are taken by steps, and the result of each decision is partially foreseeable before the remaining decision are

taken. An important aspect to consider in DP is the fact that the decisions cannot be made separately [Sakarovitch84].

For example, the desire to obtain a reduced cost in the present must be balanced against the desire to induce low

costs in the future. This constitutes a “Credit Assignment Problem” since one must give credit or culpability to

each decision. For an “optimal planning”, it is necessary to have an effective compromise between the immediate

costs and the future costs.

More precisely, Dynamic Programming focuses on the question: How can a system learn to sacrifice its short-term

performance to improve its long-term performance? To answer this, it relies on the application of the principle of

optimality of Bellman.

If a certain succession of decisions from the initial moment t = 0 to the final moment t = K is called strategy

and a succession of jointed policies belonging to a strategy is called sub-strategy, the Bellman optimality principle

[Bellman57] can be defined as follows:

An optimal strategy π∗ is such as, whatever the initial state x(0) = i and the initial decision ai,k ∈ Ai, the

remaining decisions must constitute an optimal sub-strategy, with regard to the state resulting from the first

decision.

In other words:

An optimal strategy π∗ can be constituted only of optimal sub-strategies

{µx(0)(0)∗, µx(1)(1)∗, . . . , µx(K)(K)∗}.

To formulate the Bellman principle in mathematical terms, the problem of finite horizon is considered, the value

function of which is defined by:

1This methods are used within the framework of the NEural COllective INtelligence theory.2For a detailed explanation of the subject see [Bertsekas et al.96].


V πx(0) = E

π

{

gx(K)(K) +

K−1∑

t=0

g(x(t),µx(t)(t),x(t+1))(t)

}

(1)

where K is the horizon in time (or steps number), gx(K)(K) the terminal cost and g(x(t),µx(t)(t),x(t+1))(t) the

cost due to the state transition from x(t) to x(t+1). Consequently, once fixed x(0), the expectation in the equation

1 depends only of the remaining states x(1), . . . , x(k − 1). The principle of optimality can be formalized as in

([Bertsekas et al.96]):

If π∗ is an optimal strategy such as:

π∗ = {µx(0)(0)∗, µx(1)(1)∗, . . . , µx(K)(K)∗} (2)

and if while using the optimal strategy π∗, a given state x(n) arrives with a positive probability, the value function

starting from this state can be rewritten as follows:

V πx(n) = E

π

{

gx(K)(K) +

K−1∑

t=n

g(x(t),µx(t)(t),x(t+1))(t)

}

(3)

The truncated strategy:

{µx(n)(n)∗, µx(n+1)(n + 1)∗, . . . , µx(K−1)(K − 1)∗} (4)

is also optimal for the sub-problem.

The principle of optimality is easy to justify: If the truncated strategy is not optimal as it is established, when the

state x(n) is reached at the moment n, the policy showed in the equation 2 would not be optimal either, since the

cost of the state x(0) to the state x(K), while passing by x(n) could be reduced. This result provides an algorithm

of research of the optimal path in an unspecified graph.

It is on this principle proved by the absurdity that Dynamic Programming rests. The global optimization of the

objective function is replaced by a sequential optimization that consists in optimizing each stage of decision (or

period), one after the other, but by taking into account the former decisions which were made previously and the

remaining decisions [Chevalier77].

The solution of a Markov Decision Process (MDP) (see [Puterman94]) consists of finding an optimal strategy

π∗ that correspond to the minimum of the value function V πi , for each initial state i ∈ X. This consist of finding

the optimal value function V ∗

i such as:

V ∗

i = minπ

V πi , for each state i ∈ X (5)

In a MDP the system state at the next decision epoch is determined by:

p(i,ai,k,j) = P (j|i, ai,k) (6)


that represents the transition probability from the state i to the state j, when the action ai,k is completed.

Thus, the transition probability P (j|i, ai,k) from the state i to the state j depends only on the state i and the

action ai,k. This is the “Markov property” which is very important because it determines that the current state

of the environment provides all information necessary to decide the action to complete.

The Bellman equation presented below permits finding the optimal value function to solve the MDP:

V ∗

i = minµi

(

c(i,µi) +

N∑

j=0

p(i,ai,k ,j) V∗

1(j)|i = x(0), j = x(1))

for all i ∈ X (7)

where:

c(i,µi) =∑N

j=1 p(i,µi,j) · g(i,µi,j).

This equation must be considered as a system of N equations with one equation by state. The solution of this

system of equations determines the optimal value function for N states of the environment. There are many methods

to calculate the optimal policy (e.g. policy iteration and value iteration).

IV. REINFORCEMENT LEARNING

Reinforcement Learning (RL) extends the ideas of the field of Dynamic Programming to treat more complete and

ambitious goals of Artificial Intelligence (AI). Reinforcement Learning, instead of being like Dynamic Programming

based only on the resolution of the problems of optimal control, is an aggregate of ideas based on various research

fields. For example: psychology, statistics, cognitive science and computer science.

Relation between Reinforcement Learning and Dynamic Programming

The methods of RL like those of Temporal Differences (TD) [Sutton et al.98], are related to the research field of

Dynamic Programming, which, as the preceding section explained, makes it possible to solve the Markov Decision

Processes (MDP). To compute the optimal strategy, DP assumes perfect knowledge of the environment model (e.g.

transition probabilities between the states of the environment and the costs (rewards/punishments) which the agent

receives from this environment). The first question addressed by the DP was how to compute the optimal strategy

with the minimum data-processing computation, by supposing that the environment can be perfectly simulated,

without the need for direct interaction with it. The new trend in the methods of RL is the assumption of the

absence of knowledge about the environment model at the beginning and the prediction of rewards/punishments.

Moreover, instead of moving in a mental model internal space, the agent must act in the real world and observe

the consequences of its actions. In this case, we are interested in the number of real world actions the agent must

carry out to move towards an optimal strategy, rather than with the number of algorithmic iterations necessary to

find this optimal strategy. One of the points of meeting of these two fields (DP and RL) is the Bellman equation 3

which makes it possible to define an optimal strategy [Mitchell97]. The systems which learn while interacting with

a real environment and by observing their results are called online systems. On the other hand, the systems which

3See the equation 7.


learn separately, without interacting with a real environment and with an internal model of this environment are

called off-line systems.

Reinforcement Learning answers the question: how to make the mapping between the states (of an environment

whose model is not known entirely) and the actions of an agent (which interacts with this environment online) so

as to maximize/minimize a numerical signal of reward/punishment? The fact of learning by trial and error, and of

having a delayed reward/punishment which will affect the future behavior of the agent are the two most important

characteristics which differentiate it from other types of learning methods. The three most important aspects in RL

are:

a. the perception,

b. the action,

c. the goal.

The agent must be able to sense the state of the environment and to carry out the actions which modifies this

state in order to reach the conception goal.

A dilemma which arises in the field of RL and not necessarily with other types of learning is the trade-off which

exists between the exploitation phase and the exploration phase. The agent must exploit the knowledge obtained

until now to select the actions which brought it a high reward. But, at the same time, it must explore all the possible

actions in its current state, in order to select an action which can bring it a higher reward than the actions carried

out in the past. Several mathematicians have studied the dilemma of exploration-exploitation [Sutton et al.98]. It

should be noted that this dilemma is not present in the supervised learning.

Q-learning

One of the most important advances in the field of RL was the development of Q-learning [Watkins89], an

algorithm which follows an off-line strategy [Sutton et al.98]. In this case, the value-action function learned, Q ,

approximates the optimal value-action function Q∗ in a way independent of the strategy followed. In state x(t), if

the Q-values represent the environment model in an exact way, the best action will be that which has the most/less

important value (according to the case) among all possible actions ax(t) ∈ Ax(t). The Q-values are learned by

using an update rule which uses a reward r(t + 1) calculated by the environment and a function of Q-values of the

reachable states by taking the action ax(t) in state x(t). The update rule of Q-learning is defined by:

Q (x(t),ax(t))(t + 1) = Q(x(t),ax(t))

(t) +

α

[

r(t + 1) + γ minax(t+1)

Q (x(t+1),ax(t+1))(t + 1) −Q(x(t),ax(t))

(t)

]

(8)

The Algorithm 1 shows the Q-learning algorithm.

The RL method presented make it possible to solve problems of learning for only one agent. Nevertheless, when

several agents act in a common environment, these methods are not very effective. There is a lack of learning

mechanisms for systems comprising a large number of agents. In our research work, these problems are addressed


Algorithm 1: Q-learning [Watkins89]

Initialize:

The Q-values Q(xi,axi)(0) in an arbitrary way for all the states xi ∈ X

and all actions axi∈ Axi

The strategy π( e.g. ε-greedy)

t = 0

α, γ

repeat(for every episody)

Initialize t = 0 and the state x(t)

repeat

(for every step t of the episody)

Choose an action ax(t) from the state x(t) and using the strategy derived from Q

ax(t) = µπx(t)

Take the action ax(t), observe the reward r(t + 1) and the next state x(t + 1)

Q (x(t),ax(t))(t + 1) = Q(x(t),ax(t))

(t)+

α

[

r(t + 1) + γ minax(t+1)Q (x(t+1),ax(t+1))

(t + 1) −Q(x(t),ax(t))(t)

]

t = t + 1;

x(t) = x(t + 1)

until x(t) is the terminal stateuntil

by using the theory of the NEural COllective INtelligences presented in the next section. We think that the question

of collective learning will become one of the principal questions to solve in years to come.

V. NEURAL COLLECTIVE INTELLIGENCE (NECOIN) THEORY

A. Limitations of Artificial Intelligence and Machine Learning for the control of distributed systems

Currently, the theory and the techniques of control of distributed systems call largely upon the methods of AI

and Machine Learning. However, as was explained in the previous sections, the use of these methods in an isolated

way does not make it possible to solve all the problems that arise in distributed systems.

For example, the field of RL provides methods to address several problems which arise in the field of Distributed

Systems. As RL allows online learning without requiring knowledge of the model of the environment, it is well

adapted to address real problems where the correct action for the agent is not known in advance. Hence, the only

way of knowing the correct action is by interaction with the environment via rewards/punishments.

However, a simple application of the algorithms of RL is not enough to control a large distributed system. In


theory, if the nature of the environment were to allow the use of centralized communication and control (not a very

probable case in the real world), in practice, it would be difficult for only one processor to manage due to the size

of the action space. Moreover, with the constraint of the communication and control decentralized, the question of

knowing how to modify a simple RL algorithm isolated to contribute to the control of all the system arises.

B. NEural COllective INtelligence (NECOIN)

The NEural COllective INtelligence (NECOIN) theory address the last mentioned question. A NECOIN can be

defined as a large Multi-Agent System(MAS) [Ferber97], [Weiss99] where [Wolpert et al.99], [RM02] where:

a. There is a global task to perform: The problem is to determine how to configure the system to achieve this

task.

b. Many agents (or processors) exist, functioning simultaneously, which carry out local actions that affect one

another’s behavior.

c. One of the objectives in the design of a NECOIN is the decentralization of communication and/or control.

d. There is a global utility function (GUF) which makes it possible to measure the overall performance of the

system by observing the local behaviors of each agent. In most cases, the determination of this function is

not easy and its evaluation induces a communication cost.

e. The individual agents carry out algorithms of “Reinforcement Learning” (RL) [Sutton et al.98] and have a

local utility function (LUF). The use of RL enables the agent to adapt and modify its behavior.

f. The individual agents use “Neural Networks” to assist the learning process and to store the knowledge obtained

by the agent interaction.

g. One of the objectives is to simplify the algorithms used by the agents (or processors) for utilization in real-time

applications.

In a NEural COllective INtelligence the learning process consists of adapting the “local behavior” with the aim

of optimizing a given “global behavior” and must be accomplished at two levels:

• Local learning (“micro-learning”): It depends only on the individual parameters of each agent.

• Global learning (“macro-learning”): It emerges from various local behaviors through cooperation.

The problem of an individual “learning-agent” is to know what action must be performed to maximize its local

utility function. The problem becomes more complex when several agents act and must make their decisions with

the purpose of maximizing a GUF.

The central question is: how to fix the individual local utility functions (LUF) of the agents in order to maximize

the global utility function (GUF)?

In order to address the above-mentioned question, three mechanisms based on the COllective INtelligence theory

are proposed:

a. ANTS algorithms:

The behavior of the ants which makes it possible to find the shortest path between food and the ant-hill

is the principal source of inspiration for the development of certain optimization algorithms. The lessons


learned by observing colonies of insects was applied to the task of scheduling mail in a company, to solving

the problem of “the sales traveler” where it is necessary to find the shortest path while passing only once

through a given number of cities, for the routing of packets in a telecommunication network [Dorigo et al.98],

[Schoonderwoerd96], and for the detection of data-processing attacks in a distributed network environment

[Foukia et al.02], as well as in other applications. We will apply this theory to the diffusion of information

about the environment of the supply chain network. This techniques can help to accelerate the knowledge of

the environment changes.

b. Planning RL algorithms:

The exploration of the environment (the execution of unknown result actions) can involve very important

losses of performance in the MAS. We develop at the local level of each agent a mechanism for planning

(within the meaning of this term in the RL field). This mechanism consists of sending an update-message

every ε_update seconds. This ’update-message,’ will request the estimates of all adjacent agents which are

known at a given time. The ideal would be to make the periods of weak utilization profitable for making

explorations. This mechanism helps the decision-making process by anticipating possible future supply chain

scenarios.

c. Punishment mechanisms to modify the local utility functions:

Each resource has a limited capacity, which is the maximum usage that it can support. Once this limited

capacity is exceeded, an additional use will degrade the benefit to all users who, in their turn, will enter into

a cycle of competition where they will strive to take the use of the resource from other users. This excessive

use will lead to damage of the resource. This phenomenon is known as the ’Tragedy of the Commons (TOC)’

(see [Hardin68],[Turner93]).

Limitation of the individual freedom is the only means that, according to Hardin [Hardin68], can help to

avoid the TOC. The application of taxes based on the use of the resource is one of the ways in which this

tragedy can be avoided in the real world. For example, the tax on fuel leads individuals to use public transport.

In other words, there is a punishment proportional to the use of the resource. In the MAS, the manner of

imposing this type of punishment on the agents is not obvious. In the case of RL, one of the major problems

is to know how much the resource is shared. The agent should be punished only if the resource is used by

other agents.

In the proposed RL Supply Chain Management framework, each agent (department or element of the supply

chain) will execute its actions taking into account its own interests. So, a punishment for an action that does

not contribute to overall business performance must be attributed. This punishment is attributed to each agent

thanks to a mechanism of punishment developed in this work.

In the next section, it is explained how the NECOIN theory is applied in order to address a case study proposed

within the framework of the SCM problem.


VI. NECOIN AS A FRAMEWORK FOR THE SUPPLY CHAIN MANAGEMENT

A. Basic definitions

This work proposes a model of the supply chain within the framework of the NECOIN theory. In our approach,

an agent can represent any entity of the supply chain like sales agent, warehouse agent, production agent, etc. The

internal departments can be modeled as sub-agents as proposed in [Julka et al.02a]. The materials in the supply

chain are represented as objects that are part of the environment. Therefore, every agent can change or influence

these environment objects. The details of the objects are stored as attributes. For the fragment of the production

supply chain discussed in more details in the ’Case Study and Experimental Results’ Section, we define the following

elements of the NECOIN framework for SCM problem:

• Order-agent that has the knowledge on final products orders: PO

• Set of n machine-agents: M = {M1, M2, M3, . . . , Mn}

• Set of s operations executed by machine i: OPi = {O1, . . . , Os}

• Vector of non-negative values of r features for each operation Oi: ~Vi =< vi1, . . . , v

ir >, e.g. vi

1=average time

; Note: The features vary from one machine to another

• Set of m storage-agents denoting raw material providers: S = {S1, S2, S3, . . . , Sn}

• Set of s objects corresponding to a type of raw material: MP = {MP1, . . . , MPs}

• Set of n final product storages: FP = {FP1, . . . , FPn}

• Set of n objects corresponding to a type of final product: P = {P1, . . . , Pn}

• Vector of non-negative values of r features for each product Pi: ~PVi =< pvi1, . . . , pvi

r >, e.g. pvi1=product

priority

In this work, each agent has the following features (adapted from [Swaminathan et al.98] and [Weiss99]):

• The set of environment states is represented as follows: X = {x1, x2, x3, . . . }. For example, the current product

inventory, shipment of an order to a customer, etc. The knowledge about other agents is considered to be part

of the environment state. Due to the distributed nature of the environment, an agent typically only has an

incomplete view of the state or actions of other agents. For example, in some cases a production agent might

make decisions without the knowledge that a supplier has frequently defaulted on due dates.

• The capacity of an agent to act in a state x(t) = i is represented as the set of actions: Ai = {a1, a2, a3, . . . , ak}.

• An agent can be represented as a function: Action : X∗ → Ai which generally puts into correspondence the

sequences of states of the environment X∗ with the actions Ai. An agent decides the action to perform by

taking into account its experiences to the present moment. These experiences are represented as the sequence

of visited states of the environment, and the actions carried out by the agent.

• The interaction of the agent with its environment is represented as a history of events:

h : x(0)a(0)−→ x(1)

a(1)−→ x(2)

a(2)−→ x(3)...

where x(0) is the initial state of the environment. In SCM problem, the agents must interact in a continuous

way with the environment.


• The relationships between the agents in the supply chain are defined by: R = {r1, r2, r3, . . . }.

In this set of relations, the set of agents that can interact with the considered agent is defined. Thus from each

related agent the following is considered: a) its relationship to this agent (customer,supplier), b) the nature of

the agreement that governs the interaction (production guarantees) and c) the inter-agent information access

rights (the agent’s local state to be considered during the decision-making process).

• The priorities of every agent are represented by: Q = {q1, q2, q3, . . . } These priorities can help in sequencing

incoming messages for processing.

• The local utility function (LUF) is represented as follows:

LUF = Q(x(t),ax(t))(t+1) = Q(x(t),ax(t))

(t)+α

[

r(t+1)+γ minax(t+1)

Q(x(t+1),ax(t+1))(t+1)−Q (x(t),ax(t))

(t)

]

where:

– The Q-values Q (x(t),ax(t))give an estimation of the supply chain environment. The way in which the

Q-values update is done can be considered as one of the most important problems to resolve in our

framework.

– The reinforcement for the action performed is represented by r(t + 1).

This equation represents the Q-learning (see [Sutton et al.98] for more details) equation used in the Reinforce-

ment Learning field.

• The set of control elements: C = {c1, c2, c3, . . . }

A control element is invoked when there is a decision to be made while processing a message. For example,

in order to determine the next destination in the transport of materials, a routing-control algorithm would be

utilized.

Every agent has a message handler that is responsible for sending and receiving different messages to facilitate

communication among the agents. The messages’ function is to communicate information, to send requests,

and to communicate environment states.

• The set of incoming messages: I = {i1, i2, i3, . . . }

• The set of outgoing messages: O = {o1, o2, o3, . . . }

• The message processing semantics for messages can be represented as the following function: M(m(i)). The

message m(i) must be processed by the agent, for example, a message from an agent asking for the price of

a product.

VII. SIMPLIFICATION OF THE Q-NEURAL ALGORITHM FOR JOB ROUTING TASK

In this section, the job routing problem is considered to illustrate the proposed approach to the optimization of

the production supply chain. To address this problem, the adaptation of the Q-neural algorithm is proposed and

described.


(productoperation)

Approximator(state, Q-values)

Evaluation Function(state, Q-values)

Reflexes

Dynamic Environment

Update rule

Utility FunctionState

SPA

Q-values Q-values approximation

Reinforcementactionselection

Fig. 2. Diagram of the Q-neural algorithm

A. General remarks

The behavior of the Q-neural [RM02] (see Fig. 2) algorithm was inspired by the Q-routing algorithm operation

([Littman et al.93]), the theory of NECOIN and the algorithms based on the behavior of the colonies of ants.

Q-neural is able to adapt to changes in the supply chain environment. It stores an estimate, for each pair

(operation,machine) (Oj , Mi), of the cost in time induced by sending the raw material by a neighbor machine-agent

Mi.

The learning is done at two levels: initially, at the local level of every agent thanks to the update of the Q-values

by using a RL rule, then, at the global level of the whole system thanks to the local utility functions adjustment.

In the implementation of Q-neural within the framework of the supply chain problem, and more precisely, of the

job routing, the final products are composed of technological processes based on a set of operations that must be

executed for each product.

The control messages allow for knowledge of the environment of the supply chain by updating the Q-values

which are approximated by using a function approximator (look-up table, neural network, etc.). It is thanks to the

control messages that a machine-agent can know the model of the environment and can be able to react to the

possible changes. In Q-neural, there are 5 types of control messages:

a. An ’environment-message’ (flag_ret=1) is a message which is generated by an intermediate machine-agent

after the reception of a raw material if the interval of time ω is already past.

b. An ’ant-message’ (flag_ret=2) is a message generated by the FP-storage-agent according to the interval of


time w_ants when a final product arrives at the FP-storage-agent.

c. An ’update-message’ (flag_ret=3) is a message generated in the planning phase every ε_update seconds to

ask to the neighboring machine-agents their estimates about the operations of the final products.

d. An ’update-back-message’ (flag_ret=4) is a message used in the planning phase in order to accelerate the

knowledge of the environment of the supply chain. This message is generated after the reception of an

update-message.

e. A ’punishment-message’ (flag_ret=5) is a message used to punish a machine-agent using a congested resource.

B. Initialization of the algorithm

The first step of the algorithm is the initialization of the parameters for each agent in the supply chain network.

The table I presents the principal parameters to be adjusted at the beginning. These parameters are related to the

RL method used, in this case Q-learning, and to the Q-neural algorithm in general.

α Learning rate

ε Exploration rate

ω Interval of time for the updates of the environment-messages

ω_ants Interval of time for the updates of the ant-messages

ε_update Interval of time for the updates of the update-messages

TABLE I

PARAMETERS TO ADJUST IN THE INITIALIZATION PHASE OF THE Q-NEURAL ALGORITHM

The learning rate, α, indicates the speed with which the Q-values change, e.g. the importance given at the present

moment. The exploration rate indicates the percentage (0 ≤ ε < 100) of time during which the machine-agent will

choose a random action. Each machine-agent, when its operation has just started, does not have information about

the environment. It must, however, explore its environment in an intelligent way as soon as it has raw materials to

send.

The parameter ω indicates the period of time which will have to pass before the sending of an environment-

message. The parameter ω_ants represents the period of time before an ant is created and replied by the FP-

storage-agent. Finally, ε_update represents the interval of time for the updates of the update-messages.

C. Evaluation function

Evaluation of the input vector: From the input vector ~x(t) obtained from the raw material header and environment

variables, the Q-value is obtained from the function approximator. For example, if a raw material which corresponds

to the product Pi and the technological process {O1, O3, O4} is received by the machine-agent Mi, the Q-value

for every machine-agent My which can execute the operation O3 is obtained as follows :


Qµ

(~x(t),My)(t) (9)

Exploitation: For the phase of exploitation, it is necessary to choose the action (or machine-agent) with the

lowest Q-value which is stored in the function approximator. If the machine-agent Mi must decide which route

to send raw material, it sends it via the neighbor which, in its opinion, represents the shortest time for the raw

material to arrive at the FP-storage-agent.

If the machine-agent Mi has 4 neighbors, for example, the 4 Q-values:

Qµ

(~x(t),M1)(t),Qµ

(~x(t),M2)(t),Qµ

(~x(t),M3)(t),Qµ

(~x(t),M4)(t)

are compared to choose the action of sending the raw material by a neighbor machine-agent.

D. Planning as exploration

Exploration can involve significant loss of time. In Q-neural, a mechanism of planning (within the meaning of this

term in RL) was developed at the local level of each agent. This mechanism consists of (see algorithm 2) sending

an update-message every ε_update seconds. This update-message that has in the header a variable flag_ret = 3

to differentiate it from other messages, will ask for the Q-values estimates of all the products which are known at

that moment by the neighbors. The ideal would be to exploit periods of weak production to make explorations.

Algorithm 2: Planning Algorithm in “Q-neural”Every ε_update

Send an update-message (flag_ret = 3) to all the neighbors to ask

their estimates of all the known productsif (if an update-message is received) (flag_ret = 3)

Send an update-back-message (flag_ret = 4)

to the source machine-agent of the update-message with

the estimates Qµ

(~x(t),a~x(t))(t)

of all the known products at instant t

if (if an update-back-message is received) (flag_ret = 4)

Update the Q-value in the same way as is used

for an environment-message (flag_ret = 1)

E. Environment of the supply chain

The environment model of an agent can be defined as all that helps the agent to make good decisions.


Environment-messages: Once the raw material is sent from the machine-agent Mx to the machine-agent My,

the machine-agent Mx will receive on behalf of the environment (in this case the machine-agent My selected) an

environment-message which makes it possible to update the action selection policy. However, in our experiments,

we observed that it is not practical to receive an environment-message each time, since this contributes to congestion

of the communication channel. Moreover, the cost does not change significantly between the sending of two raw

materials. Thus, the parameter called ω was added and indicates the interval of time which must pass before an

environment-message is sent in reply.

Ants algorithm: When an environment-message arrives at the FP-storage-agent, an ant is sent in return if the

period of time ω_ants is already passed. This ant exchanges all the statistics obtained on its way. If it arrives at the

source, it dies. The ant updates the Q-value of each machine-agent through which the raw material passed before

arriving at the FP-storage-agent. This is shown in the algorithm 3.

Algorithm 3: Ant-message algorithm

if (a product arrives at FP-storage-agent)

then the FP-storage-agent generates an ant-message if the time ω_ant have passed

if (a machine-agent Mi receives the ant-message from the neighbor machine-agent My or FP-storage-agent

then

Read the estimate QA(~x(t),a~x(t))

(t) from the header of the ant-message (flag_ret = 2)

Consult the best estimate at present time QM(~x(t),a~x(t))

(t)

if QM(x(t),a~x(t))

(t) > QA(x(t),a~x(t))

(t) and (a cycle is not detected)

then Update the Q-value by using QA(x(t),a~x(t))

(t)

else

Do not update

Update-messages: As we explained in the section VII-D, the function of an update-message is to update the

Q-value estimates of the neighbors of the machine-agent by using the planning mechanism developed.

F. Reinforcement function

The function of reinforcement represents the time of production of a product between two machine-agents Mx

and My. The function of reinforcement is defined as follows:

r(t + 1) = ttx,y + tqx,y + to (10)


a. ttx,y, time of transit in the line which is between the machine-agents Mx and My.

b. tqx,y, time of the raw material in the queue which is in the line between the machine-agents Mx and My.

c. to, operation time in the machine My.

G. Punishment mechanisms

Mx1

Mx2

Mx3

Mz1My1

My2

Fig. 3. Conflicts between two or several machine-agents

The Figure 3 presents a situation in which the cooperation between several machine-agents is necessary. Let us

assume that each machine-agent Mx1, Mx2, and Mx3 generates processed raw material at a constant way. Also let

us assume that for these machine-agents the best estimate is while passing by My1 and for the machine-agent My1

while passing by Mz1. If each machine-agent acts in a greedy way, congestion can occur in the queue which is in

the line ly1,z1 as the result of the greedy behavior of the machine-agents.

In the case of congestion, it can be remarked that the machine-agents Mx1 and Mx2 do not have any choice to

make. On the other hand, if the machine-agent Mx3 sacrifices its individual utility, or rather, if a means of punishing

the use of the line lx3,y1 by this machine-agent is applied, and it is forced to use the line lx3,y2, this will help to

decrease the congestion.

In order to address this problem a communications protocol between the machine-agent, and an algorithm to

allot a punishment to the machine-agents were developed. This algorithm is presented in the algorithm 4.

For the explanation of this algorithm the figure 3 will be used. Let us analyze the case where congestion arises

in the queue which is in the line ly1,z1. In this case, the machine-agent Mz1 realizes and sends a message to My1.


After the reception of the message, the machine-agent My1 calculates the second best estimate. It then sends a

message to all the neighbors Mx1, Mx2, Mx3 which are sending raw materials. Then, it receives the second best

estimate of each neighbor Mxi. Lastly, it chooses the best estimate among the received estimates of the neighbors

as follows:

argminQMxi(~x(t),y2)(t) (11)

If the machine-agent, My1 has the second best estimate, QMy1

~x(t),y2)(t), and the best estimate of the neighbors,

arg minQMxi

(~x(t),y2)(t), also exists, then, if Q

My1

(~x,y2)(t) is lower than argminQMxi

(~x(t),y2)(t), the line ly1,z1 will be

punished as follows:

QMy1

(~x(t)d,z1)(t + 1) = Q

My1

(~x(t),y2)(t) + ∆ (12)

In the Figure 3, this obliges the My1 to use the line ly1,y2. On the other hand, if QMy1

(~x(t),y2)(t) is higher than

arg minQMxi

(~x(t),y2)(t), a punishment-message is sent. If the best estimate of the neighbors, arg minQMxi

(~x(t),y2)(t),

does not exist, the line ly1,z1 will be punished directly as shown before.

If the second best estimate, QMy1

(~x(t),y2)(t), does not exist, and if the best estimate of the neighbors argminQMxi

(~x(t),y2)(t)

exists, a punishment-message is sent.

Finally, when a punishment-message is received, the line lxi,y1 will be punished.

H. Update function

The Q-values update of a machine-agent at the instant t + 1, is made from the information of the headers of the

raw materials and environment state. The update rule of the Q-neural algorithm is based on Q-learning and it is

represented as follows:

Q (~x(t),a~x(t))(t + 1) = Q (~x(t),a~x(t))

(t) + α

[

r(t + 1) + γ mina~x(t+1)

Q(~x(t+1),a~x(t+1))(t + 1) −Q (~x(t),a~x(t))

(t)

]

(13)

a. r(t + 1) = ttx,y + tqx,y + to

b. mina~x(t+1)Q(~x(t+1),a~x(t+1))

(t+1), the best estimate for the raw material arrives at FP-storage-agent from the

point of view of the next machine-agent My′ at the instant t + 1.

The algorithm 5 describes the operation of “Q-neural” within the framework of the SCM problem.

The algorithm Q-neural is executed in a parallel way and simultaneously with the other algorithms presented:

Algorithm of planning, algorithm of punishment and algorithm of ants.

VIII. CASE STUDY AND EXPERIMENTAL RESULTS

In this section we present an example of some of the performance analysis that has resulted from the model

described in this paper to investigate the impact of the dynamic job routing decisions on the overall systems


Algorithm 4: Punishment algorithmif a punishment-message is received

-Compute the second best estimate QMy

(~x(t),z′)(t) to arrive at the FP-storage-agent

by using a line that is not ly,z

-Send a message to all the neighbors machine-agents Mxi that are producing raw materials.

-Receive the second best estimate of every neighbor Mxi.

-Select the best estimate among all the estimates of the neighbors:

argminQMxi

(~x(t),y′)(t)

if (the second estimate QMy

(~x(t),z′)(t) exists)

if (the best estimate arg minQMxi(~x(t),y′)(t) of the neighbors exists)

if ((QMy

(~x(t),z′)(t) < (arg minQMxi(~x(t),y′)(t)))

Punish the line ly,z :

QMy

(~x(t),z) =second estimate QMy

(~x(t),z′)(t) + ∆

else

Punish the line lxi,y :

Send a punishment-message

else

Punish the line ly,z :

QMy

(~x(t),z) =second estimate QMy

(~x(t),z′)(t) + ∆

else

if (the second best estimate minQMxi

(~x(t),y′)(t) of the neighbors exists)

Punish the line lxi,y :

QMxi(~x(t),y)(t) =second estimate QMxi

(~x(t),y′)(t) + ∆

Send a punishment-message with the estimate

performance and adaptability against disturbances. In our experimental models, routing flexibility is introduced into

the production system by providing jobs with a flexible processing order for their operations. That is, there is no

technological constraint on the processing sequence of the operations of the jobs.


Algorithm 5: Q-neural1. Initialize at t = 0 :

-All the Q-values Qx(t),ax(t)with hig values

-The RL parameters: α, γ, , exploration, ω, ω_ants

REPEAT Update the instant t

IF a raw material is received by machine Mi

Read the input vector ~x from the raw material header and environment variables

Send the message to the machine or agent where the raw material arrives Mx

with the value of the reinforcement function r(t+1) and the estimation Qµ

(x(t),a~x(t))(t)

Execute the operation O′ and choose the action a~x(t) = M ′ in function of the input vector ~x

by using the strategy ε-greedy) derived from Q (~x(t),a~x(t))(t)

Send the raw material a~x(t) = M ′

At the next time step, receive the message from the machine M ′

with the value of the reinforcement function r(t+1) and the estimation Q~x(t+1),a~x(t+1)(t + 1))

Apply the Q-learning update rule:

Q(~x(t),a~x(t))(t + 1) = Q (~x(t),a~x(t))

(t)+

α

[

r(t + 1) + γ mina~x(t+1)Q(~x(t+1),a~x(t+1))

(t + 1) −Q(~x(t),a~x(t))(t)

]

REPEAT

Algorithm of planning ε_update

Algorithm of punishment

Algorithm of Ants

A. Experimental model implementation

The below-described model has been implemented using Netlogo simulator 4 in order to test the performance

of the developed algorithms. The layout of the simulated production SC is shown in figure 4. According to the

model defined in Section VI, the first tier consists of suppliers of raw materials. In the simulation model, this

tier is represented by the central storage. Raw materials are distributed among the processing units (machines)

organized into several different tiers. We consider that the operation lists corresponding to each machine of the tier

are identical. Finally, all the processed parts are stored at the storage of the final products.

In the implemented system, the machine agents Mi are considered as the most important agents within the

4Netlogo is a cross-platform agent-based parallel modeling and simulation environment. For more details see

http://ccl.northwestern.edu/netlogo.


Machines processingRaw Materials

Raw Materials Storage

Transition

Buffer

Final Product Storage

Fig. 4. Layout of the production Supply Chain case study (a screen-shot of the Netlogo simulator)

production chain, since they make decisions on the job routing sequence implementing the routing algorithms. The

first process that they execute at the beginning of the simulation is the initialization of the values of so called the

neighborhood tables. By neighbors we refer to those agents that belong to the next production tier. In these tables

they maintain the registers of all operations that can be carried out in the next coming tier, as well as the machine

agents that can implement them (pertaining to this tier), and operation processing times. The machine that can

carry out the next operation faster is selected. Based on this information, decisions on the job routing sequence are

taken locally but the strategy of decision making is constantly improved by applying the Q-neural algorithm. As

we mentioned in last sections, this algorithm considers the following parameters for the evaluation of the learning

function:

• Transition time, the time that it takes to the raw material or intermediate product object to arrive to its

destination. Although it is proposed that the transport system is modeled as a set of serial transporters (AGVs),

in our model the velocity of transportation of each part is calculated based on the transportation speed and

machine layout in order to simplify the system.

• Waiting time, the time that the RM objects spend in the waiting line of the corresponding buffer.

• Processing time, the time that it takes the machine to carry out the operation.

To enable the adaptation scheme, each time, processing is over, a message is sent to the previous machine-agent

to update its table with the information on the total time (processing, waiting and transition) on each step of the


production process.

As mentioned, each machine agent Mi has an entrance buffer, in which the raw materials and intermediate products

are stored. This buffer constantly varies in their content in function to the production flow and the duration of the

operations on each one of the existent elements in the buffer. It is important to notice that besides the input buffer (a

real one), each machine has an output buffer (a virtual one) that represents the possible alternative for the products

routing for the next operation. This buffer is used for choosing the best alternative for the product flow by means

of negotiation process between the current machine agent and those from the next tier.

If the current machine corresponds to the final tier of the production SC, this final product is sent to the FP-

storage-agent. The agent associated with it maintains the inventory of all the products successfully processed by

the system.

B. Case study description and performance results

In the previous sub-section, we have shown how the different agents can interact with each other to carry out the

control of the production processes. We distribute agents and product objects over the SC network, and have them

interact with each other as described above to simulate the communication and cooperation of the actual controllers

distributed in a production plant. In this section, we will present an example to demonstrate the interaction model

of the agents and the resulting optimization of the production processes. We present an example of some of the

performance analysis that has resulted from the model described in this paper to investigate the impact of the

dynamic job routing decisions on the overall systems performance and adaptability against disturbances.

Suppose we have to produce two products P1 y P2 under the following production scheme:

P1 = {O1, O2, O3}

P2 = {O1, O2}

The initial demand is composed of 5 products P1 and one product P2. The raw materials storage S1 is connected

to the machines of the first tier {M1, M2, M3}. For simplicity, the second production tier is composed of machines

{M4, M5, M6}, and the final tier of machines {M7, M8, M9} connected to the storage of final products FP. These

characteristics are defined in Table II.

To update the machine-agent tables the Q-neural algorithm was implemented. Planning and punishment sub-

algorithms were applied each second of simulation time and ant sub-algorithm was applied each 0.5 sec. The

general parameters were: learning rate α = 0.8 and exploration rate ε = 0.08.

At the second stage of the experiments, an adaptation of the Q-routing algorithm ([Littman et al.93]) within the

framework of the SC problem was compared with the Q-neural algorithm, described in this paper. The comparison

of these two algorithms can be found in Fig. 5. This figure shows the number of products produced (arrived at the

FP storage) vs the time average production.


TABLE II

CHARACTERISTICS OF THE CASE STUDY PRODUCTION SC

Machine Operations Operation duration Acquaintances

M1 {O1} O1 = 0.5s {M4,M5,M6}

M2 {O1} O1 = 0.5s {M4,M5,M6}

M3 {O1} O1 = 0.5s {M4,M5,M6}

M4 {O2} O2 = 0.5s {M7,M8,M9, FP}

M5 {O2, O3} O2 = 0.9s {M7,M8,M9, FP}

O3 = 0.5s

M6 {O2} O2 = 0.5s {M7,M8,M9, FP}

M7 {O3} O3 = 0.5s {FP}

M8 {O3} O3 = 0.5s {FP}

M9 {O3} O3 = 0.5s {FP}

Fig. 5. Comparison of the results of the Q-routing and Q-neural algorithms

In the investigated case study shown in Fig. 5 the adaptability and good performance of Q-neural can be remarked.

Machine-agents based in Q-routing, make their decisions in a greedy way to optimize their local utility functions.


This conduce to a buffer’s saturation and a decrement of the global utility function is obtained as result of this

greedy behavior. However, the machine-agents based in Q-neural make their decisions taking into account the global

supply chain performance, not only at the local level. As result, the performance of the supply chain scenario is

improved thanks to the adaptation of the changes of the supply chain environment.

IX. CONCLUSIONS AND FUTURE WORK

Today’s challenge is to optimize the overall business performance of the modern enterprise. In general, the

limitations of traditional approaches to solving the Supply Chain Management problems include the fact that the

supply network model do not correspond to the reality because of incomplete information, complex dynamic

interactions between the elements, and the need for centralization of the control elements and information.

In this paper, Supply Chain Management problem is addressed within the framework of NECOIN theory. In order

to optimize the global behavior of the chain, learning process using RL algorithms to adapt local behaviors of the SC

elements is used. This model is implemented in the agent-based parallel modeling and simulation environment over

the Netlogo platform. In our previous work ([Chandra et al.02]), we argued that solving the problem of knowledge

integration agent technology marks a future trend for developing virtual enterprises in general, and SCM systems

in particular. The open nature of the MAS is provided by the agent organization similar to that of the distributed

enterprise and supported in the agent platform responsible to provide flexibility both in component aggregation and

interaction between them. This scheme breaks away with the traditional paradigm of organization structure, such

as hierarchical, matrix, etc., and has the distinguishing feature taking into account the convenience of the enterprise

to participate in groups, while allowing fast adaptations in responding to enterprise dynamics. The fact that agents

make decisions about the nature and scope of interactions with other components at runtime makes the engineering

of complex systems easier.

Being the agglutinating centre of the SC information infrastructure, an AP can also serve as an experimental

test-bed for the implementation of the models developed in this paper. By means of this, we can easily implement

the algorithms tested in the simulated environment into the real-world applications.

We conclude that the SCM problem is well situated for the application of the NECOIN theory. In addition, the

adaptive algorithm presented, Q-neural, provides better throughput and reliability than other algorithms that suffer

from problems like the “Tragedy of the Commons”.

At this moment, we are working in a refinery supply chain simulator. The use of simulation as a mean for under-

standing the issues of organizational decision-making has gained considerable attention in recent years [Swaminathan et al.98].

Internet is used as communication infrastructure for the agents. Another challenge is the ERP (SAP) data integration

required to recover all the relevant information that can lead to better agent’s decision making. In addition, we are

working on the different agents (departments) control elements. For example, in the network pipeline, we are working

on a crude routing algorithm in order to optimize the transportation time. In future work, an adapted model of the

CMAC Neural Network ([RM02]) will be used for the Q-values approximation. More complicated punishment

algorithm will be developed to adjust the local utility functions. We also pretend to compare our algorithms with


other classic optimization methods.

Current research in multi-agent heterarchical control systems usually implement part driven real-time scheduling

algorithms, where part agents use an auction bidding resource reservation protocols to explore the routing or process

sequencing flexibility in real-time [Sheremetov et al.03]. A comparison of these alternative approaches on a common

platform is under development. The tested control systems will have varying production volumes (to model the

production system with looser/tighter schedules) and disturbance frequencies, so that the impact of the job routing

and sequencing decisions in various manufacturing environments can be evaluated. Finally, in the on-going work

we are being developed another implementation of the proposed model by using JADE agent platform.

ACKNOWLEDGMENTS

Partial support for this research work has been provided by the IMP, within the project D.00006. The authors

would like to thank Jorge Martínez and Juan Guerra for their contribution in the implementation of the developed

algorithms.

APPENDIX

TABLE III

NOTATION

X = {x1, x2, x3, . . . } Set of environment states.

A = {a1, a2, a3, . . . } Set of agent actions.

x(t) State of the environment at time t.

a(t) Action at time t.

Ai = {ai1 , ai2 , . . . , aik} Set of k executable actions in state x(t) = i.

µx(t)(t) = aik Action policy: maps the states i ∈ X and the actions aik ∈ Ai .

V π Value function

V ∗

i Optimal value function

Q(i,aik)(t) Q-value for the state i ∈ X and action aik ∈ Ai at time t

0 ≤ α < 1 Learning rate

γ Reduction rate

PO Order-agent that have the knowledge on finals products orders

M = {M1, M2,M3, . . . , Mn} Set of n machine-agents

OPi = {O1, . . . , Os} Set of s operations executed by machine i

~Vi =< vi1, . . . , vi

r > Vector of non-negative values of r features for each operation Oi, e.g. vi1=average time ; Note:

The features vary from one machine to another

S = {S1, S2, S3, . . . , Sn} Set of m storage-agents denoting raw material providers

MP = {MP1, . . . , MPs} Set of s objects corresponding to a type of raw material

FP = {FP1, . . . , FPn} Set of n final product storages

P = {P1, . . . , Pn} Set of n objects corresponding to a type of final product~PVi =< pvi

1, . . . , pvir > Vector of non-negative values of r features for each product Pi, e.g. pvi

1=product priority


REFERENCES

[Bellifemineand et al.99] Bellifemineand (F.), Poggi (A.) et Rimassi (G.). – JADE: A FIPA-Compliant agent framework. In : Proc. Practical

Applications of Intelligent Agents and Multi-Agents, pp. 97–108.

[Bellman57] Bellman (R.). – Dynamic Programming. – NJ: Princeton University Press, 1957.

[Bellman58] Bellman (Richard). – On a routing problem. – Quarterly of Applied Mathematics, 1958.

[Bertsekas et al.96] Bertsekas (Dimitri P.) et Tsitsiklis (John N.). – Neuro-Dynamic Programming. – Athena Scientific, Belmont,

Massachusetts, 1996.

[Chandra et al.02] Chandra (C.), , Smirnov (A.) et Sheremetov (L.). – Multi-Agent Technology for Supply Chain Network Information

Support. In : SAE’2002 World Congress Technical Paper Series. – USA, 2002.

[Chevalier77] Chevalier (Alain). – La programation dynamique. – Dunod Décision, 1977.

[Dorigo et al.98] Dorigo (M.) et Caro (G. Di). – Ant Net: Distributed Stigmergetic Control for Communications Networks. Journal of

Artificial Intelligence Research 9, 1998, pp. 317–365.

[Dreher99] Dreher (D.). – Logistik-Benchmarking in der Automobil-Branche: ein Führungsinstrument zur Steigerung der

Wettbewerbsfähigkeit. In : Keynote Speech at the International Conference on Advances in Production Management

Systems - Global Production Management. – Berlin, 1999.

[Eloranta et al.99] Eloranta (E.), Holmström (J.) et Huttunen (K.). – Keynote Speech at the International Conference on Advances in

Production Management Systems - Global Production Management. pp. 6–10.

[Ferber97] Ferber (Jaques). – Les Systèmes Multi-Agents : Vers Une Intelligence Collective. – InterEditions, 1997.

[FIP] FIPA Iterated Contract Net Interaction Protocol Specification, Foundation for Intelligent Physical Agents. URL:

http://www.fipa.org.

[Ford et al.62] Ford (L.) et Fulkerson (D.). – Flows in networks. – Princeton University Press, 1962.

[Foukia et al.02] Foukia (N.), Fenet (S.), Hassas (S.) et Hulaas (J.). – An Intrusion Response Scheme: Tracking the Alert Source using

a Stigmergy Paradigm. 2nd International Workshop on Security of Mobile Multiagent Systems (SEMAS-2002) at the

Sixth International Conference on Autonomous Agents and Multiagent Systems, 2002.

[Hardin68] Hardin (Garrett). – The tragedy of the commons. Science, 1968.

[JE et al.00] J. Eschenbächer (P. Knirsch) et Timm (I.J.). – Demand Chain Optimization by Using Agent Technology. In : IFIP

WG 5.7 International Conference on Integrated Production Management, pp. 285–292. – Norway, 2000.

[Julka et al.02a] Julka (Nirupam), Srinivasan (Rajagopalan) et i. Karimi. – Agent-based supply chain management-1: framework.

Computers and Chemical Engineering, vol. 26, 2002.

[Julka et al.02b] Julka (Nirupam), Srinivasan (Rajagopalan) et i. Karimi. – Agent-based supply chain management-2: a refinery

application. Computers and Chemical Engineering, vol. 26, 2002.

[Lambert et al.00] Lambert (D. M.) et Cooper (M.C.). – Issues in supply chain management. Industrial Marketing Management, vol. 29,

2000.

[Littman et al.93] Littman (M.) et Boyan (J.). – A Distributed Reinforcement Learning Scheme for Network Routing. School of Computer

Science, Carnegie Mellon University, 1993.

[Mitchell97] Mitchell (Tom M.). – Machine Learning. – McGraw-Hill, 1997.

[Puterman94] Puterman (Martin L.). – Markov Decision Processes. – Wiley-Interscience publication, 1994.

[RM02] Rocha-Mier (L.E.). – Apprentissage dans une Intelligence Collective Neuronale : application au routage de paquets

sur Internet. – PhD thesis, Institut National Polytechnique de Grenoble, 2002.

[Sakarovitch84] Sakarovitch (Michel). – Optimisation Combinatoire. – Hermann, 1984.

[Schoonderwoerd96] Schoonderwoerd (Ruud). – Collective Intelligence for Network Control. – Master’s thesis, Delft University of

Technology, Faculty of Technical Mathematics and Informatics, 1996.

[Shen et al.99] Shen (W.) et Norrie (D.H.). – Agent-Based Systems for Intelligent Manufacturing: A State of the Art Survey.

Knowledge and Information Systems, 1999.

[Sheremetov et al.03] Sheremetov (L.), Martínez (J.) et Guerra (J.). – Agent Architecture for Dynamic Job Routing in Holonic Environment

Based on the Theory of Constraints. In : Honolic and Multi-Agent Systems for Manufacturing (First International

Conference on Industrial Applications of Holonic and Multi-Agent Systems, pp. 124–133.


[Sutton et al.98] Sutton (Richard) et Barto (Andrew). – Reinforcement Learning: An Introduction. – The MIT Press, 1998.

[Swaminathan et al.98] Swaminathan (Jayashankar M.), Smith (Stephen F.) et Sadeh (Norman M.). – Modeling supply chain dynamics: A

multiagent approach. Decision Sciences, vol. 29 (3), 1998.

[Turner93] Turner (Roy M.). – The Tragedy of the Commons and Distributed AI Systems. 12th International Workshop on

Distributed Artificial Intelligence, 1993.

[Watkins89] Watkins (C.). – Learning from Delayed Rewards. – PhD thesis, Cambridge University, 1989.

[Weiss99] Weiss (Gerard). – Multiagents Systems: A Modern Approach to Distributed Artificial Intelligence. – The MIT Press,

1999.

[Wolpert et al.99] Wolpert (David) et Kagan (Tumer). – An Introduction to Collective Intelligence. – Technical Report NASA-ARC-

IC-99-63, NASA Ames Research Center, 1999.

second world conference on pom and 15th pom …

Documents