[ieee icc 2012 - 2012 ieee international conference on communications - ottawa, on, canada...

5
1 Delay-optimal Fair Scheduling and Resource Allocation in Multiuser Wireless Relay Networks Mohammad Moghaddari, Ekram Hossain, and Long Bao Le Abstract—We consider fair delay-optimal user selection and power allocation for a relay-based cooperative wireless network. Each user (mobile station) has an uplink queue with heteroge- neous packet arrivals and delay requirements. Our system model consists of a base station, a relay station, and multiple users working in a time-division multiplexing (TDM) fashion, where per-user queuing is employed at the relay station to make the analysis of such system tractable. We model the problem as an infinite-horizon average reward Markov decision problem (MDP) where the control actions are functions of the instantaneous channel state information (CSI) as well as the queue state information (QSI) at the mobile and relay stations. To address the challenge of centralized control and huge complexity of MDP problems, we introduce a distributive and low-complexity solution. A linear structure is employed which approximates the value function of the associated Bellman equation by the sum of per-node value functions. Our online stochastic value iteration solution converges to the optimal solution almost surely (with probability 1) under some realistic conditions. Simulation results show that the proposed approach outperforms the conventional delay-aware user selection and power allocation schemes. Keywords: Cooperative cellular networks, delay-optimal scheduling, temporal fairness, constrained Markov decision process (CMDP), online stochastic learning algorithm. I. I NTRODUCTION Previous work in the literature (e.g., in [1]-[4]) did not address the problem of delay-optimal scheduling and power allocation for uplink transmission in multiuser wireless relay networks. Our goal is to minimize the end-to-end (e2e) delay for mobile stations (MSs) through user scheduling under peak power constraints of each node along with individual fairness constraint. We model the problem as an infinite-horizon av- erage reward constrained Markov decision problem (CMDP) where the control actions are functions of the instantaneous CSI as well as the QSI at the users and relay station (RS). However, there is no simple solution associated with such MDP and brute force value iterations or policy iterations could not lead to any viable solution due to the curse of dimension- ality [5]. Moreover, the problem is further complicated under distributed implementation requirements where the control actions are computed locally based on the local CSI and QSI measurements without huge signalling overhead. To this end, we introduce a distributed and low-complexity approach where a linear structure is employed which approximates the value function of the associated Bellman equation by the sum of per-node value functions. Our online stochastic value iteration solution converges to the optimal solution almost surely (with probability 1) under some realistic conditions. This work was supported by a Strategic Project Grant from NSERC, Canada. Cross-layer resource allocation controller BS . . . QR,1 . . . QS,1 1 2 QS,2 QS,K QR,K K . . . QSI: , , () , , Sk Rk t Q k Q k Q k , , k , , , CSI: , , () , Rk R BS t H k H H R BS , H R Fig. 1. System model and cross-layer resource allocation with respect to both MAC layer queue state (QSI) and PHY layer channel state (CSI). II. SYSTEM MODEL AND ASSUMPTIONS We consider the uplink transmission in a single cell of a relay-assisted TDMA network as shown in Fig. 1. Each allocated time-slot of transmission has two phases and the main policy of the users is as follows. The transmission from the user to RS occurs in the first phase and the transmission from RS to BS happens in the second phase (half-duplex relaying). The relay does not have packets of its own and only forwards the packets that have been received from the users. We assume K users in the sector, to each one, a separate portion of the relay’s common buffer is allocated (Q R,1 , Q R,2 , ..., Q R,K ) as in Fig. 1. These K application streams may have different source arrival rates and delay requirements and this corresponds to a heterogeneous user situation. Let Q S (t)= {Q S,k (t), k} and Q R (t)= {Q R,k (t), k} be the joint QSI of the K-user and decomposed RS system respectively, where Q S,k (t) denotes the number of packets in the k-th user queue and Q R,k (t) represents the number of packets in the k-th queue portion at RS (used to store k- th user’s packets) at beginning of the t-th slot. We denote Q(t)= Q S (t) Q R (t) and (Q S,k (t),Q R,k (t)) as the global and local QSI at time-slot t, respectively. Let H(t)= {H k,R (t), k} {H R,BS (t)} be the joint CSI, where H k,R is the channel gain between the k-th user and the RS, and H R,BS is the channel gain between the RS and BS. We consider the block fading channel where the small scale fading coefficient remains quasi-static within a scheduling slot and i.i.d. between scheduling slots. We denote by H(t) and H k,R (t) the global and local CSI at time-slot t, respectively. Data packets arrive according to K random arrival pro- cesses, A(t)= {A k (t), k} where we assume A k (t) is i.i.d. over scheduling time-slots based on a general distribution f A (a) with E[A k (t)] = λ k . Let y k (t) ∈{0, 1} denote the time-slot allocation index for the k-th user, i.e., y k (t)=1 Workshop on Cooperative and Cognitive Mobile Networks 5553

Upload: long-bao

Post on 29-Mar-2017

213 views

Category:

Documents


0 download

TRANSCRIPT

1

Delay-optimal Fair Scheduling and ResourceAllocation in Multiuser Wireless Relay Networks

Mohammad Moghaddari, Ekram Hossain, and Long Bao Le

Abstract—We consider fair delay-optimal user selection andpower allocation for a relay-based cooperative wireless network.Each user (mobile station) has an uplink queue with heteroge-neous packet arrivals and delay requirements. Our system modelconsists of a base station, a relay station, and multiple usersworking in a time-division multiplexing (TDM) fashion, whereper-user queuing is employed at the relay station to make theanalysis of such system tractable. We model the problem as aninfinite-horizon average reward Markov decision problem (MDP)where the control actions are functions of the instantaneouschannel state information (CSI) as well as the queue stateinformation (QSI) at the mobile and relay stations. To addressthe challenge of centralized control and huge complexity ofMDP problems, we introduce a distributive and low-complexitysolution. A linear structure is employed which approximates thevalue function of the associated Bellman equation by the sum ofper-node value functions. Our online stochastic value iterationsolution converges to the optimal solution almost surely (withprobability 1) under some realistic conditions. Simulation resultsshow that the proposed approach outperforms the conventionaldelay-aware user selection and power allocation schemes.

Keywords: Cooperative cellular networks, delay-optimal

scheduling, temporal fairness, constrained Markov decision

process (CMDP), online stochastic learning algorithm.

I. INTRODUCTION

Previous work in the literature (e.g., in [1]-[4]) did not

address the problem of delay-optimal scheduling and power

allocation for uplink transmission in multiuser wireless relay

networks. Our goal is to minimize the end-to-end (e2e) delay

for mobile stations (MSs) through user scheduling under peak

power constraints of each node along with individual fairness

constraint. We model the problem as an infinite-horizon av-

erage reward constrained Markov decision problem (CMDP)

where the control actions are functions of the instantaneous

CSI as well as the QSI at the users and relay station (RS).

However, there is no simple solution associated with such

MDP and brute force value iterations or policy iterations could

not lead to any viable solution due to the curse of dimension-

ality [5]. Moreover, the problem is further complicated under

distributed implementation requirements where the control

actions are computed locally based on the local CSI and QSI

measurements without huge signalling overhead. To this end,

we introduce a distributed and low-complexity approach where

a linear structure is employed which approximates the value

function of the associated Bellman equation by the sum of

per-node value functions. Our online stochastic value iteration

solution converges to the optimal solution almost surely (with

probability 1) under some realistic conditions.This work was supported by a Strategic Project Grant from NSERC,

Canada.

Cross-layer resource allocation

controller

BS . . .

QR,1

.

.

.

QS,1

1�

2�QS,2

QR

QS,K

QR,K

K�

.

.

.

QSI: � � � �, ,( ) , ,S k R kt Q k Q k� � �Q ,�� k,� ,� k,�� ,� ,��

CSI: � � � �, ,( ) ,R k R BSt H k H� �H � �,� R BS� ,�H� R�

Fig. 1. System model and cross-layer resource allocation with respect toboth MAC layer queue state (QSI) and PHY layer channel state (CSI).

II. SYSTEM MODEL AND ASSUMPTIONS

We consider the uplink transmission in a single cell of

a relay-assisted TDMA network as shown in Fig. 1. Each

allocated time-slot of transmission has two phases and the

main policy of the users is as follows. The transmission from

the user to RS occurs in the first phase and the transmission

from RS to BS happens in the second phase (half-duplex

relaying). The relay does not have packets of its own and

only forwards the packets that have been received from the

users. We assume K users in the sector, to each one, a

separate portion of the relay’s common buffer is allocated

(QR,1,QR,2, ...,QR,K) as in Fig. 1. These K application

streams may have different source arrival rates and delay

requirements and this corresponds to a heterogeneous user

situation.

Let QS(t) = {QS,k(t), ∀k} and QR(t) = {QR,k(t), ∀k}be the joint QSI of the K-user and decomposed RS system

respectively, where QS,k(t) denotes the number of packets

in the k-th user queue and QR,k(t) represents the number

of packets in the k-th queue portion at RS (used to store k-th user’s packets) at beginning of the t-th slot. We denote

Q(t) = QS(t) ∪QR(t) and (QS,k(t), QR,k(t)) as the globaland local QSI at time-slot t, respectively.Let H(t) = {Hk,R(t), ∀k}

⋃{HR,BS(t)} be the joint CSI,where Hk,R is the channel gain between the k-th user and theRS, and HR,BS is the channel gain between the RS and BS.

We consider the block fading channel where the small scale

fading coefficient remains quasi-static within a scheduling slot

and i.i.d. between scheduling slots. We denote by H(t) andHk,R(t) the global and local CSI at time-slot t, respectively.Data packets arrive according to K random arrival pro-

cesses, A(t) = {Ak(t), ∀k} where we assume Ak(t) is i.i.d.over scheduling time-slots based on a general distribution

fA(a) with E[Ak(t)] = λk. Let yk(t) ∈ {0, 1} denote the

time-slot allocation index for the k-th user, i.e., yk(t) = 1

Workshop on Cooperative and Cognitive Mobile Networks

5553

2

determines that the k-th user is scheduled to transmit in the

t + 1-st time-slot and yk(t) = 0, otherwise. The maximum

achievable data rate in bit-per-second for the transmission of

the k-th user is given by

RS,k = yklog(1 + pk|Hk,R|2), k = 1, 2, . . .K, (1)

where pk is the transmitted power of the k-th user, and

the noise power is normalized to 1 for simplicity. Similarly,

for the second phase from RS to BS we have RR,k =yklog(1 + pR|HR,BS |2) where pR is the transmission power

of the RS. We consider deterministic packet length for

each user’s application, hence RS,k and RR,k can be con-

verted to packet-per-second through a proportionality constant.

Accordingly, the queue dynamics for user k is given by

QS,k(t+1) = min{[QS,k(t)−RS,k(t)]++Ak(t), LQ} wherex+ = max{x, 0}, LQ denotes the maximum buffer size for all

users. Furthermore, the queue dynamics of the k-th partition ofthe RS is given by QR,k(t+1) = min{[QR,k(t)−RR,k(t)]++RS,k(t), LQ}, where RR,k is the packet rate of the second

phase from RS to BS.

We denote by S(t) = (H(t),Q(t)) the global system state

at the t-th slot. Given an observed system state realization

(S), the scheduled transmitter may adjust the transmit power

allocation according to a stationary policy. A stationary time-

slot and transmit power allocation policy (Π) maps the systemstate S to the actions space U . Π is called feasible if the

associated actions satisfy the optimization constraints (e.g.,

average transmit power and users’ fairness constraints in our

problem). It was shown in [6] that S(t + 1) only depends

on S(t) and actions at t-th time-slot, and hence the induced

random process {S(t)} for a given control policy Π, is

Markovian with the following transition probability:

Pr{S(t+ 1)|S(t),Π(t)} =

Pr{H(t+ 1)}Pr{Q(t+ 1)|S(t),Π(S(t))}, (2)

where the queue dynamic transition probability kernel

Pr{Q(t+ 1)|S(t),Π(S(t))} is given by

Pr{Q(t+ 1)|S(t),Π(S(t))} =

Pr{Ak∗(t) = QS,k∗(t+ 1)− [QS,k∗(t)−RS,k∗ ]+}Iψ,(3)

where condition ψ is true when QS,k(t + 1) = QS,k(t) +Ak(t), ∀k �= k∗, QR,k∗(t + 1) = QR,k∗(t) + RS,k∗(t),and QR,k(t + 1) = QR,k(t), ∀k �= k∗ are true. It should

be mentioned that low-load regime and negligible blocking

probability assumption is considered in (3). Given a unichain

policy Π, the induced Markov chain {S(t)} is ergodic and

there exists a unique steady state distribution πS [5].

The average number of packets that arrive at the k-th MS

is given by λk(1− P bk) where λk and P bk are arrival rate and

blocking (dropping) probability for MSk. This is the same

as the average number of packets received by the relay’s

corresponding buffer as the two buffers are in tandem. Thus, by

Little’s law, the average time a packet spends in the e2e system

is∑

k(QS,k+QR,k)

λk(1−P bk )

. For a sufficiently small packet dropping

rate (sufficiently large buffer and low-load regime) 1−P bk ≈ 1,

the average e2e delay of the two-hop relay-assisted system is

Dk = limT→∞

1

T

T∑t=1

[K∑k=1

QS,k(t) +QR,k(t)

λk

]

= EπS

[K∑k=1

QS,k +QR,kλk

],

(4)

where EπS denotes the expectation with respect to the induced

steady state distribution. We assume the average arrival rate of

the packets is such that the users’ queues are considered stable.

From this assumption the stability of the RS’s queues follows.

Therefore, the overflow rate of the system is considered

negligible. Similarly, each MS’s average power constraint and

temporal fairness constraint are given by

P k = limT→∞

1

T

T∑t=1

EΠ [yk(t)pk(t)] = E

πS [ykpk] ≤ PU , (5)

Y k = limT→∞

1

T

T∑t=1

EΠ [yk(t)] = E

πS [yk] ≥ φk, (6)

where PU is the user’s maximum transmit power and φkdenotes the minimum relative frequency at which user kshould be chosen, with φk ≥ 0 and

∑Kk=1 φk ≤ 1.

III. CMDP PROBLEM FORMULATION

In this section, we shall formulate the delay-optimal prob-

lem as an infinite-horizon average cost CMDP and discuss

the general solution. The goal of the scheduler is to choose

an optimal stationary feasible unichain policy Π such that

the average end-to-end delay (4) is minimized subject to the

average power constraint (5) and temporal fairness constraint

(6) at each user node. This problem is an infinite-horizon

average cost CMDP with system state space S = Q × H,action space U = P × Y , where P = {pk, ∀k} is the powerallocation action space, and Y = {yk, ∀k} is the user selectionaction space. The transition kernel is given by (2), and per-

stage cost function d (S,Π(S)) =∑Kk=1

(QS,k+QR,k)λk

.

The above CMDP problem can be converted into an uncon-

strained MDP by Lagrange theory. We define the Lagrangian

as L(Π, γ) = limT→∞

1/T∑Tt=1 E

Π [L (S(t),Π(S(t)), γ)],

where

L (S(t),Π(S(t)), γ) =K∑k=1

[QS,k(t) +QR,k(t)

λk+

γp,k(yk(t)pk(t)− PU )− γy,k(yk(t)− φk)],

(7)

in which γ = [γp,1, . . . , γp,K , γy,1, . . . , γy,K ] is the Lagrangemultipliers (LM) vector and the corresponding unconstrained

MDP is given by

G(γ) = minΠ

{L(Π, γ)}

= minΠ

{limT→∞

1

T

T∑t=1

EΠ [L (S(t),Π(S(t)), γ)]

},(8)

where G(γ) gives the Lagrange dual function. It wa shown

in [7] that there exists a Lagrange multiplier γ ≥ 0 such that

5554

3

Π∗ minimizes L(Π, γ) and the saddle point condition holds.

For a given LM vector, the optimizing unichain policy for the

unconstrained MDP (8) can be obtained by solving the asso-

ciated Bellman equation w.r.t. (θ, {J(S)}) for i = 1, . . . , |S|as:

θ + J(Si) =

minΠ(Si)

{L(Si,Π(Si), γ

)+

∑Sj

Pr[Sj |Si,Π(Si)]J(Sj)

},

(9)

where J{S} is the value function of the MDP and

Pr[Sj |Si,Π(Si)] is the transition kernel which can be obtainedfrom (2), θ = minΠ {L(Π, γ)} is the optimal average cost

per stage and the optimizing policy is Π∗.The Bellman equation in (9) is very complicated to solve

due to the curse of dimensionality and a brute-force solution

could not lead to any useful implementation. As was shown in

[2], (9) can be simplified into an equivalent form by exploiting

the i.i.d. structure of the CSI process H(t).

IV. LINEAR VALUE FUNCTION APPROXIMATION SCHEME

The control policy obtained by solving (9) is the same as

that obtained by solving the following equivalent Bellman

equation [6]:

θ + V (Qi) = minΠ(Qi)

{L (Qi,Π(Qi), γ

)+∑

Qj

Pr[Qj |Qi,Π(Qi)]V (Qj)}, ∀Qi,(10)

where V (Qi) = EH[J(Qi,H)|Qi] is the conditional aver-

age value function for state Qi, and L(Qi,Π(Qi), γ

)=

EH[L(Qi,Π(Qi), γ

) |Qi] is the conditional per-stage cost.To further simplify the solution to (10), the following linear

approximation of the value function can be used:

V (Q) =K∑k=1

LQ∑q=0

[VS,k(q)I [QS,k = q] + VR,k(q)I [QR,k = q]

],

(11)

where {VS,k(q)}, and {VR,k(q)}∀k = 1, . . . ,K are called

the per-node value functions at each MS and RS respectively,

and {V (Q)} is the global value function. Compared with the

original value function in (9), the dimension of the per-node

value functions is much smaller. Therefore, the per-node value

function can only satisfy (9) in some predetermined system

queue states, which are referred to as the representative states.

Let Qrep = {δk,q, ζk,q|∀k = 1, . . . ,K; q = 1, . . . , LQ} be therepresentative states where δk,q and ζk,q correspond to users’

and relay’s states respectively. Without loss of generality, we

let VS,1(0) = · · · = VS,K(0) and VR,1(0) = · · · = VR,K(0)and set QI = (0, . . . , 0) denoting that all buffers are empty asthe reference state.

A. Obtaining the Control Policy Using Per-node Value Func-tions

Using the approximate value function in (11), we shall

derive a distributed control policy which depends on the local

CSI and local QSI as well as the per-node value functions at

each node k (∀k = 1, . . . ,K) as follows:

Π∗(Qi) = arg minΠ(Qi)

{L (Qi,Π(Qi), γ

)+

∑Qj

Pr[Qj |Qi,Π(Qi)]V (Qj)}

= arg minΠ(Qi)

{∑k

[QiS,k +QiR,k

λk+

∑a

fA(a)V (Qin)

]+

EH

[∑k

Fk(yk, pk)

]}⇔ arg min

Π(Qi)

{EH

[∑k

Fk(yk, pk)

]},

(12)

where Qin = [QiS,1 + a, . . . , QiS,K + a,QiR,1, . . . , Q

iR,K ] and

Fk(yk, pk) = γp,kykpk + γy,kyk +∑a fA(a)[VS,k(Q

iS,k −

RS,k + a)− VS,k(QiS,k + a)] + VR,k(QiR,k +RS,k −RR,k)−

VR,k(QiR,k +RS,k).

Lemma 1 (Distributed Control Policy): Given {Vk(q)} andQi,H, the following distributive control solves (12) (∀k =1, . . . ,K):

• Power control for the MS-RS link: p∗k =arg min

pk{Fk(yk, pk)},

• User selection index: y∗k = arg minyk{Fk(yk, p∗k)}.

As for the second phase, from RS to BS, we employ the

same policy as the first phase, i.e. the relay transmits RR,k(t)packets from its k∗-th queue to the BS. The rate (number of

packets) of transmission can be calculated in a way similar to

that in (1). Using Lemma 1, we can obtain the control policy in

a distributive manner. To this end, we need to design an online

learning algorithm to estimate the per-node value functions in

MSs and RS as well as the LMs.

B. Online Distributed Stochastic Learning Algorithm

Each MS initiates its per-node value functions and LMs,

denoted as {V 0S,k(q)}, {γ0p,k}, and {γ0y,k} as well as the per-

node value functions for the RS node, denoted as {V 0R,k(q)}.

At the beginning of the t-th frame, the RS node broadcasts

its corresponding QSI QR,k(t) to each MS nodes. Based on

the local system information (QS,k(t), QR,k(t),Hk(t)) and

the per-node value functions {V tS,k(q)} and {V tR,k(q)}, eachMS determines the distributive control actions including the

users’ power allocation p∗k, users’ time-slot allocation index

y∗k. Each MS updates the per-node value functions {V t+1S,k (q)},

and {V t+1R,k (q)} as well as the LMS {γt+1

p,k , γt+1y,k } according

to:

V t+1S,k (q) = V tS,k(q) + εtv[q + Fk(y

∗k, p

∗k)− V tS,k(q)] (13)

×I[Q(t) = δk,q],

V t+1R,k (q) = V tR,k(q) + εtv[q + Fk(y

∗k, p

∗k)− V tR,k(q)] (14)

×I[Q(t) = ζk,q],

γt+1p,k =

[γtp,k + εtp(pk(t)− PU )

]+, (15)

γt+1y,k =

[γty,k + εty(yk(t)− φk)

]+, (16)

5555

4

M−1 =

⎡⎣

0 I[Q1 = δ1,1] . . . I[Q1 = δ1,LQ ] . . . 0 I[Q1 = δK,1] . . . I[Q1 = δK,LQ ]. . . . . . . . . . . . . . . . . . . . .

0 I[Q|Q| = δ1,1] . . . I[Q|Q| = δ1,LQ ] . . . 0 I[Q|Q| = δK,1] . . . I[Q|Q| = δK,LQ ]

⎤⎦

T

. (17)

where {εtv}, {εtp}, {εty} are the step size sequences satisfy-ing:

∑∞t=0 ε

tv = ∞, εtv > 0, limt→∞ εtv = 0;

∑∞t=0 ε

tp = ∞,

εtp > 0, limt→∞ εtp = 0;∑∞t=0 ε

ty =∞, εty > 0, limt→∞ εty =

0;∑∞t=0[(ε

tv)

2+(εtp)2+(εty)

2] <∞, limt→∞ εtp/εtv = 0, and

limt→∞ εty/εtp = 0.

While a brute-force centralized solution will lead to enor-

mous complexity as well as signalling loading to deliver the

global CSI and QSI to the controller, the computational com-

plexity of the online stochastic learning algorithm executed

at each node grows only linearly with the number of nodes,

i.e., O(K). This is because determining the per-node value

function at any user depends only on its own queue state and

its counterpart at RS, feedback by RS at the beginning of each

time-slot, and independent of the states of the other users.

Thus, similar computational complexity as in [2] and [8] is

incurred.

C. Proof of Convergence of the Distributed Online LearningAlgorithm

We shall establish technical conditions for the almost-

sure convergence of the online distributive learning algorithm.

Since {εtv}, {εtp}, {εty} satisfy εtp = o(εtv), εty = o(εtp), theLMs update and the per-node potential functions update are

done simultaneously but over two different time scales. During

the per-node potential functions update (timescale I), we have

γt+1p,k − γtp,k = O(εtp) = o(εtv) and γ

t+1y,k − γty,k = O(εty) =

o(εtv). Therefore, the LMs appear to be quasi-static [?] duringthe per-node value function update in (13), (14). Due to the

brevity of the paper, we only show the sketch of the proof for

the users’ value function VS,k(q), as similar approach can be

taken for VR,k(q). For details, refer to [10].

Let the sequence of matrices {Xt} and {Zt} as Xt−1 =(1 − εt−1

v )I + M−1P(Πt)Mεt−1v and Zt−1 = (1 − εt−1

v )I +M−1P(Πt−1)Mεt−1

v , where Πt is a unichain system control

policy at the t-th frame, P(Πt) is the transition probability

matrix of system states given the unichain system control

policy Πt, I is identity matrix and M is given in (17) on top

of the next page, as in [2]. Assume for all the feasible policy

in the policy space, there exists some positive integer β and

τβ > 0 such that[Xβ−1 . . .X1

](r,I)

≥ τβ ,[Zβ−1 . . .Z1

](r,I)

≥ τβ , ∀r (18)

where [.](r,I) denotes the element in r-th row and

I-th column (I corresponds to the reference state

QI). Equation (11) can be written as V = MW or

W = M−1V, where V = [V (Q1), . . . , V (Q|Q|)]T and

W = [WTS WT

R]T is the parameter vector. Moreover, WS =

[VS,1(0), . . . , VS,1(LQ), . . . , VS,K(0), . . . , VS,K(LQ)]T and

WR = [VR,1(0), . . . , VR,1(LQ), . . . , VR,K(0), . . . , VR,K(LQ)]T .

It can be shown that the following statements are true.

The update of the parameter vector (or per-node

potential vector) will converge almost surely for any

given initial parameter vector W0 and LMs γ, i.e.,

limt→∞ Wt(γ) = W∞(γ). Therefore, the steady state

parameter vector W∞ satisfies

θe + W∞(γ) = M−1T(γ,MW∞(γ)), (19)

where θ is a constant, e is a K(LQ + 1) × 1 vector with all

elements equal to 1, and the mapping matrix T is defined as

T(γ,V) = minΠ[L(γ,Π) + P(Π)V]. Now note that (19) is

equivalent to the following Bellman equation on the represen-

tative states of Qrepθ + V∞

k (q) = minΠ(δk,q)

{L (δk,q,Π(δk,q), γk)

+∑Qj

Pr[Qj |δk,q,Π(δk,q)]K∑k=1

V∞k (Qjk)}, ∀δk,q ∈ Qrep.

(20)

Hence, (19) and (20) basically guarantee that the proposed on-

line learning algorithm will converge to the best fit parameter

vector (per-node potential) satisfying (11). On the other hand,

since the ratio of step sizes satisfies εtp/εtv, ε

ty/ε

tv → 0 during

the LM update (timescale II), the per-node value function will

be updated much faster than the LMs. By Corollary 2.1 of

[9], we have limt→∞ ‖Vtk − V∞k (γt)‖ = 0 with probability

1 (w.p.1). Moreover, for the convergence of the LMs over

timescale II, we claim that the iteration on the vector of LMs

γ = [γp,1, . . . , γp,K , γy,1, . . . , γy,K ]T converges almost surely

to γ∗ = [γ∗p,1, . . . , γ∗p,K , γ

∗y,1, . . . , γ

∗y,K ]T which satisfies the

power and users’ fairness constraints in (5), and (6). For

the same conditions as mentioned in Section IV.B, we have

γt,Wt → (γ∗,W∞(γ∗)) w.p.1, where (γ∗,W∞(γ∗)) satisfiesθe + W∞(γ∗) = M−1T(γ∗,MW∞(γ∗)) and the power and

users’ fairness constraints in (5), and (6).

V. PERFORMANCE EVALUATION

By simulations, we shall compare our proposed distributed

online per-node value function learning algorithm to two

reference benchmarks. One is the traditional round-robin (RR)

scheme which is a non-opportunistic scheduling policy that

schedules users in a predetermined order. At time slot t, the(t mod K+1)-th user is chosen. The other is the online rate

equivalent (ORA) scheme in [8] which converts a CMDP

problem corresponding to a 1-hop multiuser system into Ksubproblems corresponding to a single-user system. In this

approach the minimum required rate of each user is computed

and user scheduling is done in a greedy method.

We assume the total bandwidth is 1 MHz, the packet

arrival at each user node is Poisson with average arrival

rate λ = 20 packets/s and deterministic packet size in each

time-slot. Arrivals are generated in an i.i.d. manner across

slots. We consider Rayleigh fading channel model for the

first hop, where each user’s channel state Hk,R is selected

from the probability density function expressed as fH(h) =

5556

5

2 4 6 8 10 12 14 1650

51

52

53

54

55

56

57

58

Average 1st−hop SNR (dB)

Ave

rage

end

−to−

end

dela

y (ti

me−

slot

)

Proposed scheme1−hop rate equivalent schemeRound−robin scheme

Fig. 2. Average e2e delay versus 1st-hop SNR for a 10-user system.

2 3 4 5 6 7 85.2

5.4

5.6

5.8

6

6.2

6.4

6.6

6.8

7

Number of users

Ave

rage

end

−to−

end

dela

y pe

r use

r (tim

e−sl

ot)

Proposed scheme1−hop rate equivalent schemeRound−robin scheme

Fig. 3. Average e2e delay per user versus the number of users for K = 10users, LQ = 10, and SNR = 6 dB.

hα2 exp

(−h2

2α2

), h ≥ 0, with α = −4 dB. We consider a 10-

user system with the maximum buffer size of LQ = 10 for

each queue in the system and maximum power of PU = 1 W .

The average e2e delay of the system versus average SNR

of the first-hop is illustrated in Fig. 2. It can be observed that

the proposed distributive algorithm could achieve significant

performance gain in average delay over the ORA scheme.

As expected the e2e delay in RR scheme does not vary with

channel gains.

Fig. 3 compares the delay performance of the three ap-

proaches with different number of users. The average transmit

SNR for each user is 6 dB. As the number of users related to

the BS grows, the average rate of e2e delay increase per user

is much higher for RR and ORA schemes. This confirms the

convergence of the proposed scheme to an optimal solution

even in a relatively-large state space.

Fig. 4 indicates the long-term time fraction allocations of all

10 users under the various scheduling policies for the problem.

For each user, the rightmost bar shows the minimum time

fraction requirement (φk). The remaining three bars representthe time fraction allocated to this user in the three policies

evaluated here.

Fig. 5 shows the convergence property of the approximate

MDP approach using distributed stochastic learning. We plot

the average per-node value functions of the users versus the

scheduling slot index at a transmit SNR = 10 dB. It can

be seen that the distributed algorithm converges quite fast

Fig. 4. Time fraction allocation (temporal fairness) in a 10-user system.

0 50 100 150 200 250 3005

10

15

20

25

30

35

40

45

50

55

60

Number of iterations

Val

ue fu

nctio

n

VS,1(10)

VS,1(9)

VS,1(8)

VS,1(7)

VS,1(6)

VS,1(5)

VS,1(4)

VS,1(3)

VS,1(2)

VS,1(1)

Fig. 5. Convergence property for the distributed online learning algorithmfor K = 10 users, LQ = 10, and SNR = 10 dB.

and after 200 iterations the values are extremely close to the

final converged results. Similar results can be seen for per-

node value function at the RS. The average delay per user

corresponding to the average per-node potential functions at

the 200-th iteration is 5.85, which is smaller than those of 5.95

of ORA scheme and 6.92 of RR benchmark.

REFERENCES

[1] M. J. Neely, “Optimal energy and delay tradeoffs for multiuser wirelessdownlinks,” IEEE Trans. on Inform. Theory, vol. 53, no. 9, pp. 3095-3113,Sept. 2007.

[2] Y. Cui and V. K. N. Lau, “Distributive stochastic learning for delay-optimal OFDMA power and subband allocation,” IEEE Trans. on SignalProcess., vol. 58, no. 9, pp. 4848-4858, Sept. 2010.

[3] R. Knopp and P. A. Humblet, “Information capacity and power controlin single cell multiuser communications,” Proc. IEEE International Con-ference on Communications, pp. 331-335, June 1995.

[4] X. Liu, E. K. P. Chong and N. B. Shroff, “Opportunistic transmissionscheduling with resource-sharing constraints in wireless networks,” IEEEJ. Sel. Areas Commun., vol. 19, no. 10, pp. 2053-2064, Oct. 2001.

[5] D. P. Bertsekas, Dynamic Programming: Deterministic and StochasticModels, Prentice-Hall, 1987, Englewood Cliffs, NJ, U.S.A.

[6] V. K. N. Lau and Y. Cui, “Delay-optimal power and subcarrier allocationfor OFDMA systems via stochastic approximation,” IEEE Trans. onWireless Commun., vol. 9, no. 1, pp. 227-233, Jan. 2010.

[7] V. S. Borkar, “An actor-critic algorithm for constrained Markov decisionprocesses,” System Control Letters, vol. 54, pp. 207-213, 2005.

[8] N. Salodkar, A. Karandikar, and V. S. Borkar, “A stable online algo-rithm for energy-efficient multiuser scheduling,” IEEE Trans. on MobileComputing, vol. 9, no. 10, pp. 1391-1406, Oct. 2011.

[9] V. S. Borkar, “Stochastic approximation with two time scales,” SystemsControl Lett. 29, pp. 291-294, 1997.

[10] http://www.ee.umanitoba.ca/∼ekram/convergence-proof.pdf

5557