a new qos provisioning method for adaptive multimedia in ...vincentw/c/ywlac04.pdf · a new qos...

A New QoS Provisioning Method for Adaptive Multimedia in Cellular Wireless Networks

Fei Yu, Vincent W.S. Wong and Victor C.M. Leung Department of Electrical and Computer Engineering

The University of British Columbia 2356 Main Mall, Vancouver, BC, Canada V6T 124

E-Mail: { feiy, vincentw, vleung@!ece.ubc.ca

Abstract - Third generation cellular wireless networks are designed to support adaptive multimedia by controlling individual ongoing flows to increase or decrease their bandwidth in response to changes in traffic load. There is growing interest in quality of sen-ice (QoS) provisioning under this adaptive multimedia framework, in which a bandwidth adaptation algorithm needs to be used in conjunction with the call admission control algorithm. This paper presents a novel method for QoS provisioning via the use of thr average reward reinforcement learning, which can maximize the network revenue subject to several predetermined QoS constraints. By considering handoff dropping probability, average allocated bandwidth and intra-class fairness simultaneously, our algorithm formulation yaranters that these QoS parameters are kept within predetermined constraints. Unlike other model- based algorithms, our scheme d w s not require explicit state transition probabilities and therefore the assumptions behind the underlying system model are more realistic than those in previous schemes. MoTover, by considering the status of neighboring cells, the proposed scheme can dynamically adapt to changes in traffic condition. Simulation results demonstrate the effectiveness of the proposed approach in adaptive multimedia cellular networks

Ke>words - QoS; odapfive multimedia: cellular wireless nehvorks; mnrhemrrticaIprogr~ming/oprimi;nfion

I. INTRODUCTION With the gowing demand for bandwidth-intensive

multimedia applications (e.g., video) in cellular wireless networks, qualit), of service (QoS) provisioning is becoming more and more important. An efficient call admission control (CAC) scheme is crucial to guarantee the QoS and to maximize the network revenue sirnultaneouslv. Most of the CAC strategies proposed in the literature only consider non- adaptive traffic and non-adaptive networks [I], [2]. However, in recent years, the scarcity and large fluctuations of link bandwidth in wireless networks have motivated the development of adaptive multimedia applications where the bandwidth of a connection can be dynamically adjusted to adapt to the highly variable communication environments. Examples of adaptive multimedia traffic include Motion Picture Experts Group (MPEG) - 4 [3] and H.263+ [4] coding for audiovisual contents; which are expected to be used extensively in future cellular wireless networks. Accordingly, advanced cellular networks are designed to provide flexible radio resource allocation capabilities that can efficiently

support adaptive multimedia traffic. For example, the third generation (3G) universal mobile telecommunications services (IJMTS) can reconfigure the bandwidth of ongoing calls [j].

Under this adaptive multimedia framework, a bandwidth adaptation (BA) algorithm needs to be used in conjnnction with the CAC algorithm for QoS provisioning. CAC decides the admission or rejection of new and handoff calls, whereas BA reallocates the bandwidth of ongoing calls.

Recently, QoS provisioning for adaptive multimedia senices in cellular wireless networks has been a veN active area of research [6]-[1?], 1141, [Is]. Channel sub-rating scheme for telephony services is proposed in [6]. In [7], an analytical model is derived for one class of adaptive service The extension of these schemes designed for one traffic class to the case of multiple traffic classes in real cellular wireless networks may not be an easy task. Talukdar et al. [SI study the trade-offs between network overload and fairness in bandwidth adaptation for multiple classes of adaptive multimedia. A near optimal scheme is proposd in [9]. Markov decision process formulation and linear programming are used in [IO]. Degradation ratio and degradation ratio degree are considered in [ I l l . Authors in [I?] use simulated annealing algorithm to f i d the optimal call-mix selection. The shortcomings of [6]-[1?] are that only the status of the local cell is considered in QoS provisioning. However, due to increasing handoffs betw,een cells that are shrinlting in size, the status of neighboring cells has an increased influence on the QoS of the local cell in future multimedia cellular wireless nehvorks [13], and therefore, information on neighboring cell traffic is very important for the effectiveness of QoS provisioning methods that can adapt to changes in the traffic patttm [?I. Authors in [14]; [Is] make fine attempts to consider the status information of neighboring cells. However; only one class of traffic is studied and they do not consider maximizing network revenue.

This paper introduces a novel average reward reinforcement learning (RL) approach to solve the QoS provisioning prohlem for adaptive multimedia in cellular wireless networks; which aims to maximize the network revenue while satisfying several predetermined QoS constraints. The novelties of the proposed scheme are as follows:

1) The proposed scheme takes into account the effects of the status of neighboring cells with multiple classes of traffic,

0-7803-8355-9/04/%20.00 02004 IEEE. 2130

enabling it to dynamically adapt to changes in the traffic condition.

The underlying assumptions of the proposed scheme are more realistic than those in previous schemes. Particularly, the scheme does not need prior knowledge of system state transition probabilities, which are veq difficult to estimate in practice due to imegular network topology, different propagation environment and random user mobility

The algorithm can control the adaptation jkqiiencv effectively by accounting for the cost of bandwidth adaptation in the model. It is observed in [7], [SI that frequent bandwidth switching among different levels may consume a lot of resources and may be even worse than a large degradation ratio. The proposed scheme can control the adaptation frequency more elkctively than previous schemes.

Handoff dropping probabilip, average allocated bandwidth and inrra-class jaiiness are considered simultaneously as QoS constraints in our scheme and can be guaranteed.

enhancement layers The base layer can be independently decoded to provide basic video quality, whereas the enhancement layers can only be decoded together with the base layer to further refine the quality of the base layer. Therefore, a video stream compressed into three layers can adapt to three levels of bandwidth usage.

B. Adaptive Cellular IVireless Nenvorkr Due to the severe tluctuation of resources in wireless lidis,

the ability of adapting to the communication environment is veq important in future cellular wireless networks. For example, in UMTS ystems, a radio bearer established for a call can be dynamically reconfigured during the call session [SI. Fig. I shows the signalling procedure between a user terminal (E) and the senring universal tmestial radio access network (UTRAN) in radio bearer reconfiguration. Radio bearer in UMTS includes most of the layer 2 and layer 1 protocol information for that call. By reconfiguring the radio bearer, the bandwidth of a call can be changed dynamically.

C. We consider two important functions for QoS

provisioning. CAC and BA, in this paper. The problem of

QoS Provisioning Fiaictions and Constraints

-. . . , Trading off action with state spam is proposed in

our scheme, As mentioned in the large action space problem mav hnder the deployment of this scheme in real systems. the approach of trading off action space with state space, the large action space problem in QoS provisioning can be solved.

Recently, RL has been used to solve CAC and routing

QoS provisioning in an adaptive multimedia framework is to determine CAC and BA policies to maximize the long-term network reYenuz and guarantee QoS constraints. To reduce network signalling overhead, we assume that BA is invoked only when a call amval or deparmre occurs. That is; BA will not be used when congestion occurs brietly due to channel fading. Low-level mechanisms such as error correction coding and efficient packet scheduling are usually used to handle

~roblems in wireline networks 1161, 1171 and channel brief throuehDut variations of wireless links. allocation problem in wireless netw&is[l8], jl9]. Ths paper focuses on applications of RL to solve the QoS provisioning problem in adaptive cellular wireless networks.

We compare our scheme with two existing non-adaptive and adaptive QoS provisioning schemes for adaptive multimedia in cellular wireless networks. Extensive simulation results show thal the proposed scheme outperforms the others by maximizing the network revenue while satisfying the QoS constraints.

The rest of this paper is organized as follows. Section I1 describes the QoS provisioning problems in the adaptive framework. Section I11 describes the average reward RL algorithm. Our new approach to solve the QoS provisioning problem is presented in Section IV. Section V discusses some implementation issues. Section VI presents and discusses the simulation results. Finally, we conclude this study in Section VII.

11. QOS PROVISIONING IN ADAPTIVE FRAMEWORK

.4. .4daptive Mirltimedia .4pplications In adaptive multimedia applications, a multimedia

connection or stream can dynamically change its bandwidth requirement throughout its lifetime. For example, using the layered coding techmque, a raw video sequence can be compressed into three layers [ZO]: a base layer and two

- I

Smaller cells (micro/pico-cells) Will be employed in future cellular wireless networks to increase capacity. Therefore, the number of handoffs during a call’s lifetime is likely to be increased and the status of neighbouring cells has an increased influence on the QoS of the local cell. In order to adapt to changes in traffic pattern, the status information of neighbouring cells should be considered in QoS provisioning.

We consider three QoS constraints in this paper. Since forced call terminations due to handoff dropping are generally more objectionable than new call blocking, an important call- level QoS constraint in cellular wireless networks is PM, the probabilip of handoff dropping. As it is impractical to eliminate handoff call dropping completely, the best one can do is to keep PM below a target level. In addition, although adaptive applications ciin tolerate decreased bandwidth, it is desirable for some applications to have a bound on the

RADIO BEARER RECONFIGLR4TION

RADIO BEARER RECONFIGWIION COMPLETE

I -1 Fig. 1. Radio bearer recontigumtion in UMTS [j]

0-7803-8355-9/C4/$20.00 02004 IEEE. 2131

average allocated handwidth. Therefore, we need another QoS parameter to quantify the average bandwidth received by a call. The normalized ,merage allocated bandwidth of class i calls, denoted as AB', is the ratio of the average bandwidth received by class i calls to the bandwidth with ur-degraded seryice. In order to guarantee the QoS of adaptive multimedia, AB' should be kept above a target value. Finally, due to bandwidth adaptation, some calls may operate at veq high bandwidth levels; whereas some calls within the same class may operate at v e q low- bandwidth levels. This is undesirable from users' perspective. Therefore, the QoS provisioning scheme should be fair to all calls within one class, and inrra- class fairness is defined as another QoS constraint in this paper. These constraints will be formulated in Section IV.

We formulate the QoS provisioning prohlem as a semi- Markov decision process (SMDP) [?I]. There are several well-known algorithms, such as policy iteration, value iteration and linear programming [ZI] that find the optimal solution of an SMDP. However, these traditional model-hased solutions to SMDP require prior knowledge of state transition probabilities and hence suffer from two -'curses'': the curse of dimensionality and the curse of modeling. The curse of dimensionality is that the complexity in these algorithms increases esponentially as the number of states increases. QoS provisioning involves ven large state space that m&es model-based solutions inkasihle. The curse of modeling is that in order to apply model-based methods, it is first necessaq to express state transition probabilities explicitly. This is in practice a v q difficult proposition for cellular wireless networks due to the irregular network topolog, different propagation environment and random user mobility

D. Average Reward Reinforceinent Learning In recent 'years; RL has become a topic of intensive

research as an alternative approach to solve SMDPs. Ths method has two distinct advantages over model-based methods. The first is that it can handle problems with complex transitions. Secondly, RL can integrate within it various function approximation methods (e.g., neural networks), which can be used to approximate the value function over a large state space.

Most of the published research in RL is focused on the discounted sum of rewards as the optimality metric. 0- learning [??I is one of the most popular discounted reward RL algorithms. -These techniques, however, cannot estend automatically to the average reward criterion. In QoS provisioning problems; pertbrmance measures may not suitably be described in economic terns, and hence it may be preferable to compare policies based on time averaged expected reward rather than expected total discounted reward. Discounted RL methods can lead to sub-optimal behavior and may converge much more slo\vly than average reward RL methods [23]. An algorithm for average reward RL called SMART (Semi-Markov Average Reward Techmque) [24]- [?6] has emerged recently. The convergence analysis of this algorithm is given in [Zj] and it has been successfully applied to production inventory [?4] and airline seat allocation [26]

problems. We use this average reward RL method to solve the QoS provisioning problem for adaptive wireless multimedia in this paper.

111. SOLVING AVERAGE REWARD SMDP BY RL In this section, we describe the average reward SMDP.

The optimality equation is introduced. We then describe the reinforcement leaming approach.

A . .-lverage Reward SemiMarkov Decision Process For an SMDP, let S be a finite set of states and .1 be a set

of possible actions. In state S E S , when an action a € .4 is chosen, a lump sum reward of k(s.a)is received Further accrual of reward occurs at a ratec(s',s,a)_s'E S . for the time the process remains in state s' between the decision epochs. The expected reward between hvo decision epochs. given that the system is in state s, and a is chosen at the first decision epoch may be expressed as:

where r is the transition time to the second decision epoch IT denotes the state of the ~ h l r a l process and E denotes the expectation.

Starting from state s at time 0 and using a policy K I the average reward g"(s) can be given as:

u-here on represents the time of the (n+l)th decision epoch, r, =on+, -U, , and E: denotes the expectation with respect

to policy IT and initial state s.

The Bellman optimality equation for SMDP [21] can be stated as follows.

THEOREM 1. For any finite unichain Sh4DP. there exists a scalar g'and a value function R' satiseing the system of equations

r(s,n)-g'rl(s.a)+CP,.(a)R'(s') f.3 I . where q(s.a) is the average sojourn time in state s when action n is taken in it and P J a ) is the probability of transition froin state s to state 3' under action a.

Fora proof of Theorem 1. see chapter 11 of [21].

0-7803-8355-9/04/520.00 02004 IEEE. 2132

B. Solution Using Reinforeenrent Learntng In the RL model depicted in Fig. 2_ a learning agent selects

an action for the system that leads the system along a unique path till another decision-making state is encountered. At this time. the system needs to consult with the learning agent for the next state. During a state transition the agent gathers infonilation about the new state. immediate reward and the time spent during the state-transition based on which the agent updates its knowledge base using an algorithm and selects the next action. The process is repeated and the learning agent continues to improve its performance.

Average reward RL uses the action I d u e representation that is similar to its counterpa Q-leaning. The action value R"(s,a) represents the average adjusted value of choosing an action a in state s once, and then following policy nsubsequently [?3]. Let R'(s,a) be the average adjusted value by choosing actions optimally. The Bellman equation for average reward SMDPs (3) can be rewritten as:

R'(s,a) =,(,,a) -g'q(s.a) + cP , . (a ) T;.rR'(s'.a'). (4) ri-s

The optimal policy is n'(S)=argnnixR'(s.a). The average reward RL algorithm estimates action values on-line using a temporal difference method. and then uses them to define a policy.

The action value of state-action pair (s. a) visited at the nth decision inaking epoch is updated as follows. Assume that action a in state s resdts in a system transition to s' at the subsequent decision epoch then

RJS. a ) = (1- a, )?&.a) +

a m [ ~ m t ( s ' ~ s , ~ ) - p n r n + 92%,(s'>a')l, (5)

where n, is the learning rate parameter for updating of the action value of a state-action pair of the nth decision epoch and r_,(s',s.a)is the actual cumulative reward earned behveen two successive decision epochs stalting in state s (with action a) and ending in state s' . The reward rate. pn. is calculated as:

where T(n) denotes the sum of the time spent in all states visited till the nth epoch and al is the learning rate parameter. If each action is executed in each state an lnfinite number of times on an infinite run and a, and ,& are decayed appropriately. the above learning algorithm will converge to the optimality [Zj].

t Agent

(Decision maker)

Fig. 2 . A reinforcement learning model

IV. FORMULATION OF QOS PROVISIONING IN ADAPTIVE FRAMEWORK

In adaptive multimedia cellular wireless networks. we assume that call arrivals at a given cell. including new and handoff calls. follow a Poisson distribution We further assume that each call needs a service time that is exponentially distributed and independent of the inter-arrival time distribution The QoS provisioning problem for adaptive multimedia can be formulated as an SMDP. In order to utilize the average reward RL algoritlun it is necessw to identifv the system state. actions. rewards. and constraints. The exploration scheme and the inethod to tmde off action space with state space are also described in this section.

A. Svstem States Assume that there are K classes of sewices in the network.

A class i call uses bandwidth among {ha. bi?. . , .. bg. ...I bw, )> where b,< bo+,, where t = 1. 2. . . . . K. j = I . 2_ . . . . N,. and X, is the mavimum bandwidth level which can he used by class i call. At random times an event e can occur in a cell c (we assume that only one event can occur at any time instant), where e is either a new call arrival. a Imndoff call anival. a call termination or a call handoff to a neighboring cell. At this time, cell c is in a particular cordignration x defined by the number of each type of ongoing calls in cell e. Let x = (xll, xlz, ..., x;, ,..., x m E ) . where xg denotes the number of ongoing calls of class i using bandwidth b, in cell c for 12 t 2 K and 1 2 j 2 N , . Since the status of neighboring cells is important for QoS provisioning, we also consider it in the state description The status of neighboring cells y can be defined as the number of each type of ongoing calls in all neighboring cells of cell e. Lety = ( V I , ? .VI:. . . .. ,v, .. . . ~ .vLTr ),

where y,, denotes the number of ongoing calls of class i using bandwidth b, in all neighboring cells of cell e. We assme that the status of neighboring cells is available in cell c by the exchange of status information between cells. Note that this assumption is columon among dynamic QoS provisioning schemes [2]. The configurations and the event together detennine the state. s = (x, y - e).

We assume that each cell has a fixed channel capacity C and cell e has A4 neighboring cells. The state space is defined as:

07803-8355-9/04620.00 82IYJ4 IEEE. 2133

1 si 5 K , 1s j 5 y B. Actions

When an event occurs, the agent must choose an action according to the state. An action can be denoted as: a = (a, ad ~ a"). where a. stands for the admission decision i.e., admit (U. = l), reject (a. = O), or no action due to call departures (a, = -I), ad stands for the action of bandwidth degradation when a call is accepted. and a. stands for the action of bandwidth upgrade, when there is a departure (call termination or landoff to a neighboring cell ) from cell c. ad has the form

ad= { ( d ~ ~ ~ . . . ~ d ~ ~ . . . ~ d ~ ~ ) ~ l 5 i 5 K , l < j S N j > l 5 n <;}.

where,d;denotes the number of ongoing class i calls using bandwidth b, that are degraded to bandwidth b, . a,, has the form

a.={( __ U; .... &.>)> 15 I 5 K. 1 -<; < N,> j < n 2 !vt}>

where U?; denotes the number of ongoing class i calls using bandwidth bt, that are upgraded to bandwidth b,.

configuration ( x l l , xI2> . . ._ I!, . ._. . xmZ )becomes After the action of bandwidth degradation the

Similarly. after the action-of bandwidth upgrade. the configuration (XI,. X I : , .. .. .xu,. . .. .rmr) becomes

C. Rewarcl7 Based on the action taken in a state, the nehvork earns

deterministic revenue due to the canied W i c in the cell. On the other hand. extra signaling overhead is required for bandwidth adaptation which will consume radio and wireline bandwidth as well as the batter?; power in the mobile. It is observed in 171. [SI that frequent bandwidth switching among different levels may consume a lot of resources and may be even worse than a large degradation ratio. Thus. there is a trade-off behveen the network resources utilized by the calls and the signaling and processing load incurred by bandwidth adaptation operation. We use a function to model the cost due

to the action of bandwidth adaptation The definition of the cost function depends on specific traffic. user terminal. and nehvork arclutecm in real networks. One intuitke definition is that the cost is pmponional to the number of bandwidth adaptation operations. which is used in this paper.

Let rf be the reward rate of class i call using b4wid thbU. c. be the cost of one bandwidth adaptation operation andiVa(a)be the total number of bandwidth adaptation operations in action a. The actual cumulative reward r , (s'_s.a) . between hvo successive decision epochs starting in state s (with action a) and ending in state s' can be calculated as:

where IZt(s'.s.a) is the actual sojourn time behveen the decision epochs.

By formulating the cost of the bandwidth adaptation operation in the model, we can control the adaptation operation frequency effectively. Note that all ongoing calls in the cell. including those that have been degraded or upgraded. contribute to the rewardr;,(s'_s,a). Therefore. we do not need an extra term to formulate the penally related to the bandwidth degradation

D. Coiistraiiifs For a general SMDP with L consmints. the optiml p l i q

for at most L of the states is randomized [27]. Since L is much smaller than the total number of states in the QoS provisioning problem considered in this paper. the non-randomized stationary policy learned by RL is often a good approximation to the optimal policy [28]. To a\Toid the complications of randomization we concentrate on non-randomized policies in this study.

As mentioned in Section II_ the fmt QoS constraint is related to the handoff dropping probability Let P,(s) be the measured handoff dropping ratio and TP, demte the target maximum allowed handoff dropping probability. The constraint associated with P, can be formulated as:

The Lagrange multiplier fomnlation relating the constrained optimization to an unconstrained optimization 1291. [3O] is used in this paper to deal with tbe handoff dropping constraint. To fit into this formulation we need to include the histop information in our state descriptor. The new state descriptor is S = (I",. iVw,r.s) . where AJ, and N, are the total number of handoff call requests and handoff call drops. respectively. I is the time interval bet~veen the last and the current decision epochs, and s is the on@ state

07803-8355-9/04/$u).00 02004 IEEE. 2134

descriptor. In order to make the state space finite, quantified values of P,,, = jVM IN, and I are used.

reward A L a p n g e multiplier w is used for the parameterized

F_, (F’,s_ a) = r=, (:’, ?.a) - wz(?’.:.a) I

where ret(S’.B,a) is the original reward function and z(?>:.a) = P,(S)r_,(S’.:.a) is the cost f h t i o n associated with the constraint. A nice monotonicity propeny associated with w shown in [29] facilitates the search for a suitable w .

The second QoS constraint is related to .48. the normalized average allocated bandwidth of class i calls. Let B’ denote the bandwidth allocated to class i calls, A B can be defined as the mean of E / b , over all class i calls in the current cell. Recall that b , is the bandwidth of a class i call with undegmded sewice.

AB’ should be kept larger than the target value TAL? :

.4B 2 T 4 B . i = l . ...:A-.

The third QoS consuaint is the intra-class fairness constraint, which can be defined in many ways. In this paper. we use the variance of 8/bw, over all class i calls in the cumnt cell. VB’> to characterize t k intra-class fairness:

VB’ = var{B/b ,}=

VB’ reflects the difference between the bandwidth of individual class i calls and the average bandwidth For absolute fairness. V B should be kept to zero all the time. However. this is verl, difficult to achieve in practice as bandwidth is adjusted in &%rete steps. Therefore. it is better to keep VB’ below a target value 7VB’:

1.B’ 5 T1.B’ . i = 1. .. .. K.

AB’ and 1,F are intrinsic properties of a state. With the current state and action information (?,a). we can forecasl AB’ and I.3’ in the ne\? state S’. AB(:’) and P”(F’). If .4B’(S’) 2 T48’ and I,B(S’) s 7PF. i = I _ .... K, the action is feasible. Othemise. this action should be eliminated from the feasible action setA(S) .

E. Exploration Each action should be executed in each state an infinite

number of times to guarantee the convergence of a RL algorithm. This is called exploration (311. Ei~loration plays an important role in ensuring that all the states of the underlying Markov chain are visited by the system and all the potentially beneficial actions in each state are tried out. Therefore, with a small probability p. upon the ntb decision- making epoch decisions other than that with the highest action value should be taken

In this paper, we use the Darken-Chang-Moody search- then-comerge procedure [32] to decay the leaming ratesa”, p,, and the e.qloration late p.. In the follouing expression, 0 can be substituted by a . p and p for learning and exploration respectively. We use the equation: O m = O o / ( l + c n ) . where c,=n’/(O,+n)- and 0, and 0, are constants.

F. Trading offc t ion Space Complexiiy with State Space Complexit?,

We can see that the action space in our formulation is quite large. In this paper. we propose a method to trade off action space complexity with state space complexity in the QoS provisioning scheme using a scheme described in [3 I]. The advantages of doing this are that the action space will be reduced and the extra state space complexity may still be dealt with by using the function approximation described in Section V.

Suppose that a call anival event occurs in a cell with state s_ the action that can be chosen from is

a =(a, d,!2 ,.... d , ,... ‘d& 1. N i

x N,

where there are at most V = l + ~ ~ ( j - l ) components.

We can break down the action a into a sequence of V controls a, di2 _..., d;, .... d::> and introduce some artificial

intermediate “states“ ( S. aJ , ( J . n,, di, )> . . .1 ( S I a,,

d:, .... .d,“ ,... &!,:), and the corresponding transitions to model the effect of these actions. In this way. the action space is simplified at the e.xpense of introducing I;-1 additional layers of states and V-I additional action values R( 3 . aJ. R( S , a,, d:,) ..... R ( S . a,, d: >,.... d .d:;’ ) in addition to R( 3 . a,,

d,‘*..... d,” ,... >d;:). Actually, we view the problem as a deterministic dynamic programming problem with V stage. For v = 1; ..., V . we can have a v-solution (a p m a l solution involving ju t 1’ components) for the vth stage of the problem The terminal state corresponds to the V-solution (a complete solution with V components). Moreover. instead of selecting the controls in a fixed order. it is possible to leave order subject to choice.

1-1 ,=1

0-78034355-9M20.00 QUX)4 EEE. 2135

In the reformulated problein at any given state S= (A',". IV,~.T x, y. e) where e is a call arrival of class I _ the control choices are: 1) Reject the call. in which case the configuration x does not

evolve.

2) Adnut Ute call and no bandwidth adaptation is needed, in wluchcasetheconfigurationxevolvesto (xIIIxlz, .... x u . ...I .rSN, + I . . ... xmE ).

3 ) Admit the call and bandwidth adaptation is needed. In this case. the problem can be divided into Vstages. On the vth stage ( v = 1, ..., b' ). one particular call type that has not been selected in previous stages. say the one using bandwidth h, with x, > 0 can be selected and there are following options:

a) Degrade one call using bandwidthby one level, in which case the confgu~ation x evolves to (xll, xlz>

b) Degnde hvo calls using bandw-idthb,one level, in which case the configuration x evolves to (xll. x12. . . . . .r,>-. + 2 x,, - 2 ....I rN, + 1 . xbYE ).

c) Increase the number of calls being degraded until the call arrival can be accommodated. The number of options depends on specific selected call Q-pe and the class of call arrival.

TIE similar trade-off can be applied when a call d e p m e

I>,., + 1 , .r,, - 1 ~ . . .. x + I .. ... xLYNr ). IN, . .

event occurs.

V. ALGORITHM IMPLEMENTATION

A. Approximate Representation of4ction lalues In practice. an impomnt issue is how to store the action

value R(s, a). Approximate represenlation should be used to bkak the curse of dimensionality in the face of very large state spaces. Neural network is an efficient method to represent the action values. A popular neural netwolk architecture is the wilfi-lawr percephon (I\ILP) with a single hidden laver [31]. Under this architecture. the state-action pair (s, a) is encoded as a vector and transformed linearly though the input layer involving coefficients in this layer to give several scalars. Then each of these scalars becomes the input to the signioidol firnction in the hidden layer. Finally; the outputs of the sigmoidal functions are linearly combined using coefficients. known as weiglm of the nehvork. to produce the final output.

-_____ With probabilit)i I-p.

actim value. Otherwise. perfam the ccploralim.

Fig. 3 . The slmcture of the QoS pmvisioning scheme

initialize iteration Count n '= 0. a&n value RI& a) .= 0, cumulative reward CR .= 0, total time T .= 0, reward matepo p 0

while n c blr\)(STEPS do calculate p.,a..B. using iteratbn countn with probability of [ I - pn ), tradeon action space with sbte

spaceandshoorpanactiona.~ A thatmximizes R(s..o,). Otherwise. choose a random (exploratory) anion from A

execute the chosen action waits, me nextewnt e update R-(q ,a" ) = (I-cr" )R,(s".a.)+

nehvork weights are. adjusted to nunimize the sum of the ermrs squared.

B. Striicrure and Pseudo-code The s t ~ c t u r e of the RL-based QoS provisioning scheme is

shown in Fig. 3 . When an event (either a call arrival or departure) occurs, a state s is identified by getting the status of the local and neighbouring cells. Then. a set of feasible actions {a} is found according to the state. The state and action information is fed into the mural nehvork to get the action values. With probability 1 - p.. the action with the largest action value is chosen. Otherwise. eqioration is performed and an action is chosen randomly. When the next event occurs. the action value is updated and the process is repeated. A Dseudocode descriotion of the DmDOSed scheme is given in

. 1

The nemo* IS trained in a supen ised fashon using the Fir 4 - back-propaganon algonthm This means that dunng mmg both neh\ork inputs and target outputs are used An input VI SIMULATIONRESULTSANDDISCUSSIONS pattern is applied to the nehvork to generate an output. which is compared to the comsponding target output to produce an error that is propagated back through the nemo&. The

A cellular network of 19 cells is used in OUT simulations. as shown in Fig. 5 , To avoid the edge effect of the finite nehvork sue. wap-around is applied to the edge cells so that each cell has SLY neighbours. Each cell has a fixed bandwidth

' 0-7803-8355-9M20.@3 QZW4 EEE 2136

of 2 Mbps. Two classes of flows are considered (see Table I). Class I traffic has three different bandwidth levels, 128. 192 and 256 kbps. The three possible bandwidth levels of class 2 traffic are 61. 96 and 128 kbps. Two reward functions are used in simulations. as shown in Table I. Reward function 1 represents the scenario in which the reward generated by a call is a linear growing function with the bandwidth assigned to the call. Specifcally. cj =bo . In reward function 2, a convex

function cj =(b- -(b,) -b,=) ' ) /bw is used where b- is the mximum bandwidth used by a call in the network. We assume tlnt the highest possible bandwidth level is requested by the call arrival. That is, call arrival of class 1 always requests 256 kbps and call arrival of class 2 always requests 128 kbps. Then the nehvork will make the CAC decision and decide which bandwidth level the call can use if it is admitted. 30% of the offered MIC is from class I . Moreover. call holding time and cell residence time are assumed to follow exponential distributions with meanvalues of 180 seconds and I50 seconds. respectively. The probability of a user handing off to any adjacent cell is equally likely. The target maximum allonzed handoff dropping probability, TP, I is 1% for both clwses. Other QoS constraints are changed in the simulations for evaluation purposes.

The action values are learnt by running the simulation for 30 million steps with a constant new call arrival rate of 0.1 caWsecond. The constants used in the Darken-Chang-Moody decaying scheme for the learning and e q h a t i o n rates are chosen as a, = p, = p o = 0.1 I and a, = p, = p , =IO". A monotonicity pmpelty associated with w is used to search for a suitablew , which is 157560 in the simulations. A multi- layer neural nehvork is used in the approximate representation of action values, in which there are 31 inputs Units representing the state and action, 20 hidden units with sigmoid functions. and one output unit representing the action value. The neural netn>ork is trained on-line using the back- propagation algorithm in conjunction with the reinforcement learning. The trained nehvork is then used to make CAC and BA decisions with different call arrival rates.

Two QoS provisioning schemes are used for comparisons, guard channel (GC) scheme [I] for non-adaptive traffic and ZCDO2 scheme [12] for adaptive multimedia. 256 kbps is reserved for handoff calls in the GC scheme. In ZCDO2. an optimal call mix selection scheme is derived using simulated annealing. The proposed scheme is called RL in the following. The linear reward function is used in all simulation experiments except those in Subsection W.C. where the convex reward function is used.

.4. Uniform Traflc We fmt use uniform MIC distribution in the simulations.

where the traffic load is the same among all 19 cells. Call arrivals of both classes to each cell follow a Poisson process with mean 1 .

The average rewards of different schemes normalized by that of the GC scheme are shown in Fig. 6. Average allocated

Class

Class 1

Fig. 5 . Cellular nshrork cmfigvration used in nimulations

bandwidth and intra-class fairness constraints are not considered here. We can see that RL and ZCDO2 yield more rewards than GC. In the GC scheme, a call will be rejected if the free bandwidth available is not sufficient to satis@ the request. Both RL and ZCDO2 have bandwidth adaptation function and therefore can yield more reward than GC. In Fig. 6. the reward of the proposed scheme is similar to that in ZCD02, because both of them can maximize network revenue in QOS provisioning. We can also observe that at low traffic load_ as the new call arrival rate increases. the gain becomes more significant. This is because the heavier the offered load. the inore the bandwidth adaptation is needed when the cell is not saturated. However. when the tralfic is high and the cell is becoming saturated. the performance gain of RL and ZCDO2 over GC is less sigruficant. The cost of adaptation operation is not considered. i.e.. c, = 0 in Fig. 6.

Fig. 7 shows the effects of e,. the cost of adaptation operation. when new call arrival rate is 0.067 callsisecond. The reward of ZCDo2 drops quickly as c,, increases. and even less than that in GC when c, = 150. In conuasL the reward drops slowly in RL Since RL formulates c, in the reward function it eliminates those actions requiring a large number of adaptation operations when c, is high by comparing the action values of different actions. Therefore. the proposed scheme can control the adaptation cost, and therefore the adaptation frequency, effectively. We use c. = 30 in the following simulation eqeriments.

Fig. 8 shows that RL maintains an almost constant handoff dropping probability for a large range of new call arrival rates. In contrast neither ZCD02 nor GC can enforce the QoS guarantee for the handoff dropping probability. We can reduce the I d o f f dropping probability in GC scheme by increasing the number guard channels and in ZCDO2 by increasing the 't.irtUal gain function" of handoff calls. However. this will further reduce the reward earned in these two schemes. Fig. 9 and Fig. IO show the new call blocking probabilities of class I

Table I. Experimental Parametcm

ITrntfic IBandwidth Levell Reward I Reward I (kbps) Function 1 Function 2

hu: 128 128 192 b i 2 : 192 I92 240

Class2

I b,3: 256 1 256 I 256 h i 64 61 112 bZ2 96 96 156 b., I?!? I 1 7 X I 197

0-7803-8355-9/04/520.00 82W B E . 2137

and class 2 traffic. respectively. Both ZCD02 and RL have lower blocking probabilities compared with GC_ because both of them have adaptation capability and can accept more new calls.

Fig. 11 shows the normalized allocated average bandwidths. T.413' = 0.7 is considered here. We can observe that as the new call arrival rate increases, the average bandwidths of both classes in ZCDOZ and RL decrease. This is the result of the bandwidth adaptation. For some applications. it maybe desirable to have a bounded average allocated bandwidth From Fig. I I . it is shown that the normalized allocated average bandwidth can be bounded hy the target value in RL In contrast, ZCDO2 cannot guarantee this average bandwidth QoS constraint. The average bandwidth of GC is always I. because no adaptation operation is done in GC. Note that the lowest possible normalized average bandwidth is 0.5 for both classes. This can be seen from Table I. where the lowest bandwidth level is half of the highest bandwidth level for both classes. The normalized bandwidth variance. liB . an indicator of intra-class fairness. is shown.in Fig. 12. We can see that RL can keep the bandwidth variance below the target value. Since the bandwidth in GC cannot be changed the bandwidth variance is always 0 in GC. The achievements of higher QoS requirements come at a cost to the system The effects of different values of 14BandWBon the average reward are shown in Fig. 13 and Fig. 14. respectively. We can see that higher T 4 B . which is preferred from users' point of view. will reduce the reward. Similarly, IowerFB, wluch means higher intra-class fairness, will reduce the reward as

E. Non-Unfornorn, Traf'jc

In the non-uniform traffic situation the cells in the second ring. i.e., cells 2. 3;. . .. 7 in Fig. 5. have 1.5 tiines the new call arrival rate of those cells in the outer ring, i.e.. cells 8, 9> ... 19. The central cell has 2 times the new call arrival rate of cells in the outer ring. Since the method of predicting handoff rate from neighboring cells is not given in ZCDOZ. a static predicted h d o f f rate is used in the revenue function and we

. call it ZCD02-static:Fig. 15 shows that RL yields more reward than ZCDO2-static and GC sckmes. The performance gain of RL over GC and the difference behveen RL and ZCbO2-static are si&cant in the non-uniform traffic situation This is because OUT RL method clkes into account the stam of neighbouring cells. and therefore it can dynamically adapt to different traffic patterns.

C. .4 Diferenr Reward Function A comrex reward function ?-, = (bi= - (b,j - b- )') /b- is

used in tlns situation. The reward rate for specific bandwidth level of each class is shown in Table I: The simulation results using the convex reward function show a similar pattern to those using the linear reward function and therefore only one figure is provided here. Fig. 16 shows the average rewards of different QoS schemes with non-uniform traffic. We can see that Fig. 16 is similarto Fig. 15.

. well.

i.

D. Conpitarion Cornplexiflj ZCDO2 uses simulated annealing to fmd the optimal call-

mi% in which a variable called reniperatiwe is decreased periodically by employing a monotone descendent cooling function. We follow the example given in ZCD02, where 90 temperature steps are used and each step is repeated 100 times. In each of 9000 steps. the revenue and the constraints are reevaluated. In RL. since neural network is used in the approsimate representation the major operations required to make the CAC and BA decisions come from retrieving action values and comparing these action values. We run the simulations with a fixed call arrival rate of 0.1 calls/second for 1000 call arrivals and departures; and calculate the average number of operations (additions, multiplications and comparisons) required to make one decision The number of operations is about l.SxIO'inZCD02 and IxlO'inRL. Tlus shows that ZCD02 will be more expensive than RL for computation resources in practice. However. training is needed for the RL approach whereas ZCDO2 and GC do not need any training.

VII. CONCLUSIONS In this paper. we have proposed a new QoS provisioning

method for adaptive multimedia in cellular wireless nehvorks. The large number of states and the difficulty to estimate the state transition probabilities in practical systems motivate us to choose a model free average reward reinforcement learning solution to solve this problem By considering the status of neighboring cells. the proposed scheme can dynamically adapt to the clmges in traffic condition. Tlme QoS constraints, handoff dropping probability. average allocated bandwidth and intra-class fairness have been considered. Simulation results have been presented to show the effectiveness of the proposed scheme in adaptive multimedia cellular nemo&.

Further study is in progress to reduce or eliminate the signaling overhead of exchanging status information by some feature emaction and local estimation functions. It is also vely interesting to consider other avenge reward reinforcement learning algorithms [17], [33].

ACKNOWLEDGMENT This work was support by the Canadian Natural Sciences

and Engineering Research Council through grants RGPIN 26160443 and OGPOOU286.

REFERENCES D. Hong and S. S. Rappaport "Tratfic model and paformalice analysis for czllular mobile radio telephone systems with prioritized and non- prioritised handoff procedures," IEEE Trans Veh. Techno/., vol. m- 35, pp. 77-92, Aug. 1986. S. W-U. K Y. M. Wong and B. Li. '...I dynaniic call admission policy with precision COS guaranta using stxhastic control f.or mobile ukelzss nztwxks," IEEE%4C.If %om Verworkmg, vol. 10, nu. 2. pp.

ISO'IEC 144962-2. "Information technology coding of audio-visual objects: visual,"Committez draft. Oct. 1997. ITWT H.263. T i d m coding for IOW bit rats communications," 1998.

257.n1, 2002.

0-7803-8355-9x)4620.00 OZOOZ IEEE. 2138

3GPP, “RRC protocol specification,” 3G TS25.331 version 3.12.0,

Y. B. tin, A. Nozrpzl, and D. Harase, Tile sub-mting channel assignmat strategy for PCS hand-offs.” IEEE Trans Veh. Technol., vol. 45. no. 1, pp. 122-130, 1996. C. chou and K. G. Shin “.halvsis of combined adaotive handuidth

Sept. 2002.

1 ~~ ~ ~~~

allocation and admission controiii wireless networks,” in Prm. IEEE LVFOCOMO2, June 2002. A. K. Talukdar, B. R. Badrinath. and A. Achapa “Rate adaptive schemcs in nehmrks nith mobile hosts,” in PTOC. ACM/IEF,? ~11OBICOM’98, Oct. 1998. T. KWXI. 1. Choi. Y. Choi, and S . Das, ‘Ncar optimal bandwidth adaptation algorithm for adaptive multimzdia w i c s in wirelessimobilc nehvorks.” in Pioc. IEEE PTC’99-FaIi. Sept. 1999.

[ I O ] Y. Xao. P. Chen and Y. Wang “optimal admission conUd fM multi- class of Wnlcss adaptivc multimedia snvica,” IEICE Trons. Comman.. vol. €84-B, no. 4. pp 795-804, A p d 2001

[ I l l Y. Niao. C.L.P. Chm. and B. Wan& “Band\\idth dzgradalion QaS provisioning for adaptive multimedia in wixelessimobile nzmrks,” CompurerCommrin..vol. 25.pp.1153-1161.2002.

1121 G. V. Zanrba, 1. Chlmrac. and S. 2: Das. “A prioritired rcal-time \wcless call degradation framework for optimal call mix selectim,” klobileNe1works andilppbcorrons, vol. 7, pp. 143.151, April 2002.

[ 131 T. S . Rappaport 1firde.n communrcorions: Prtncrples and proclice. En~euood Cliffs. N J Prentice Hall. 19%.

1141 T. Kwon. Y. Choi, C . Bisdikian. and M. Naghshineh, “@Os provisioning wGrelesr/mobile multimedia n e h v d s using an adaptive framervorV Wirelessherworks. vol. 9, pp. 51-59. 2003.

1151 S. Ganguly and B. Nark ‘QoS provisioning for adaptive scrvices \MIh degadatim in cellular nehvork” in Proc. IEEE lVCNC’U3. New Cklzans. Iauisianq March 2003.

1161 H. Tong and T. N. Brow. “Adaptive call admission control under quality of m i c e constraints: B reinfbrccment learning solution.” IEEE J. Select.AreosCommun., vol. 18,no. 2. pp.209-22I, 2000.

1171 P. Marbach. 0. Mihatsch. and J.N. Tsitsiklis. “Call admission control and routing in integrated services nehvorks using narro-dynamic pro@-amming” IEEEJ. Select Areas Commun., vol. 18. no. 2, pp.197- 208,2000.

1181 S. P. Singh and D. P. Bmsekas, “Reinforcment laming for dynamic channel allocation in cellular telephone systems.” in h.1. M o m el al. (Eds.) Advances in NIPS 9, pp. 974-980. 1997.

[I91 J. Nic and S. Haykin, “A Q l m i n g based dyamic channel assignmat technique for mobile communication systems” IEEE Trans. Veh. Technol., vol. 48. no. 5, pp. 1676-1687. Sept. 1999.

[20] D. Miu. Y.T. “I. and Y.-Q. Zhang, “Scalable video coding and transport over brosdband w’ireless nctwollts,” Proc. of rhe IEEE, vol. 89, “0.1. pp. 6-20. Jan. 2001.

1211 M. L. Putaman. illarkov decision processes, Wiley Intascicnce. New York USA. 1994.

1221 C. J. C. H. Watkins and P. Dayan, “Qlamin&” Machine Leorning, vol. 8, pp. 279-292, lYY2.

[23] S. Mahadwan. “Averago nward reinforccmznt learning: Foundations. algorithms, and empirical results,”hlochme Leonmg, vol. 22. pp. 159- I!%, 1996.

1241 T. t; Das. :\. Gosavi. S . Mahadevan and N. Marchalleck, solving %mi-markov decision problems using average reward reinforcement Isami~ig.“dlonogemer Science. vol. 45, no. 4. pp. 560-574, 1999.

1251 A. Gosavi. .4n oigonrhm for solving iemr-1Workov decrsron problems using reinforcemenf learning: convergence onolysis and nnrnencol resulrs. Ph.D. Dissmtation Universiiyof South Florida, 1999.

[26] A. Fosavi. N. Bandla and T. h. Ws. “A reinfMCemml learning appmach to airline sf allocation for multiple fare classs ullh overhoaking,” IIE Transacrions on Operanom Engineermg, vol. 34. pp. 729.742, 2002.

1271 E. Albnan, Consrrmmd Alorkov decirron process, Chapman and Hall, London. 1999.

(281 Z. Gabor. Z. Kalmar. and C. Szepesvan. “hlulti<ritRla reinforcement laming,” in Proc. Inrl Conf A40chine Learning, Madison, WL Jul. 1998.

[29] F. J. Beutlcr and K. W. Ross. “Optimal policies far conrrolled Markov chains with B constraint.” J. &fah. A n d Appl.. vol. 112, pp. 236-252. 1985.

1301 F. J. Beutla and 6. W. Ross, “Time-wage optimal cmshaincd semi- Markov decision prmessee,” Ad”. .4ppl. Prob.. vol. 18. pp. 341-359. 1986.

[31] D. P. Bertsekas and J. N. Tsitsiklis. Neuro-$nornic progrmnming, Athma Scicntifie. Bclmont, MA, 1996.

1321 C. Darken, 1. Chang and J. Moody. “Learning mate schedule for faster stochastic @adient search,” in Prm. IEEE Workshop on N a r d Networks for Stgngnol Pmcerang. Szpt. 1592.

1331 J. Ahaunadi. D. Bertsekas and V. S. Borkar, "Learning algorithms for Markov decision processes uilh average cost.” SUM J. Conrr. OprtrncoIfon., vol. 40, no. 3, pp 681-198, 2001

Adaptation operation cost

Fig. 7. Normalized average rewrds YS. adapration cost

&7803-8355-9M20.@3 02004 IEEE. 2139

Fig. 8. Handoff dropping probabilities

- 0 8 , ,

Fig. 1 1 . Normalized waage banduidlhs

Fig. 9. New call blocking probabilities ofclass I calls Fig, 12. hnnalized bandwidlh vnriances

Fig. IO. New call blmking probabilities ofclass 2 calls Fig. 13. Normalizedavmqe rwards for diffcrzntavmage banduidlh rcquircments

0-7803-8355-9/04i$20.00 ( O Z W IEEE. 2140

13 I

Fig. 14. Nmalizcd w m g e rewrds fmdiffirentbanduidlh variance resuiranents

Fig. 15. Normalized amage rewards with non-unifm traffic

F,

1102 ow 0 - oca 0 , 012 011 New call arrival rate

0-78034355-9x)4/$20.00 Q2W4 JEEE. 2141

a new qos provisioning method for adaptive multimedia in ...vincentw/c/ywlac04.pdf · a new qos...

Documents