[ieee icc 2012 - 2012 ieee international conference on communications - ottawa, on, canada...
TRANSCRIPT
1
Delay-optimal Fair Scheduling and ResourceAllocation in Multiuser Wireless Relay Networks
Mohammad Moghaddari, Ekram Hossain, and Long Bao Le
Abstract—We consider fair delay-optimal user selection andpower allocation for a relay-based cooperative wireless network.Each user (mobile station) has an uplink queue with heteroge-neous packet arrivals and delay requirements. Our system modelconsists of a base station, a relay station, and multiple usersworking in a time-division multiplexing (TDM) fashion, whereper-user queuing is employed at the relay station to make theanalysis of such system tractable. We model the problem as aninfinite-horizon average reward Markov decision problem (MDP)where the control actions are functions of the instantaneouschannel state information (CSI) as well as the queue stateinformation (QSI) at the mobile and relay stations. To addressthe challenge of centralized control and huge complexity ofMDP problems, we introduce a distributive and low-complexitysolution. A linear structure is employed which approximates thevalue function of the associated Bellman equation by the sum ofper-node value functions. Our online stochastic value iterationsolution converges to the optimal solution almost surely (withprobability 1) under some realistic conditions. Simulation resultsshow that the proposed approach outperforms the conventionaldelay-aware user selection and power allocation schemes.
Keywords: Cooperative cellular networks, delay-optimal
scheduling, temporal fairness, constrained Markov decision
process (CMDP), online stochastic learning algorithm.
I. INTRODUCTION
Previous work in the literature (e.g., in [1]-[4]) did not
address the problem of delay-optimal scheduling and power
allocation for uplink transmission in multiuser wireless relay
networks. Our goal is to minimize the end-to-end (e2e) delay
for mobile stations (MSs) through user scheduling under peak
power constraints of each node along with individual fairness
constraint. We model the problem as an infinite-horizon av-
erage reward constrained Markov decision problem (CMDP)
where the control actions are functions of the instantaneous
CSI as well as the QSI at the users and relay station (RS).
However, there is no simple solution associated with such
MDP and brute force value iterations or policy iterations could
not lead to any viable solution due to the curse of dimension-
ality [5]. Moreover, the problem is further complicated under
distributed implementation requirements where the control
actions are computed locally based on the local CSI and QSI
measurements without huge signalling overhead. To this end,
we introduce a distributed and low-complexity approach where
a linear structure is employed which approximates the value
function of the associated Bellman equation by the sum of
per-node value functions. Our online stochastic value iteration
solution converges to the optimal solution almost surely (with
probability 1) under some realistic conditions.This work was supported by a Strategic Project Grant from NSERC,
Canada.
Cross-layer resource allocation
controller
BS . . .
QR,1
.
.
.
QS,1
1�
2�QS,2
QR
QS,K
QR,K
K�
.
.
.
QSI: � � � �, ,( ) , ,S k R kt Q k Q k� � �Q ,�� k,� ,� k,�� ,� ,��
CSI: � � � �, ,( ) ,R k R BSt H k H� �H � �,� R BS� ,�H� R�
Fig. 1. System model and cross-layer resource allocation with respect toboth MAC layer queue state (QSI) and PHY layer channel state (CSI).
II. SYSTEM MODEL AND ASSUMPTIONS
We consider the uplink transmission in a single cell of
a relay-assisted TDMA network as shown in Fig. 1. Each
allocated time-slot of transmission has two phases and the
main policy of the users is as follows. The transmission from
the user to RS occurs in the first phase and the transmission
from RS to BS happens in the second phase (half-duplex
relaying). The relay does not have packets of its own and
only forwards the packets that have been received from the
users. We assume K users in the sector, to each one, a
separate portion of the relay’s common buffer is allocated
(QR,1,QR,2, ...,QR,K) as in Fig. 1. These K application
streams may have different source arrival rates and delay
requirements and this corresponds to a heterogeneous user
situation.
Let QS(t) = {QS,k(t), ∀k} and QR(t) = {QR,k(t), ∀k}be the joint QSI of the K-user and decomposed RS system
respectively, where QS,k(t) denotes the number of packets
in the k-th user queue and QR,k(t) represents the number
of packets in the k-th queue portion at RS (used to store k-th user’s packets) at beginning of the t-th slot. We denote
Q(t) = QS(t) ∪QR(t) and (QS,k(t), QR,k(t)) as the globaland local QSI at time-slot t, respectively.Let H(t) = {Hk,R(t), ∀k}
⋃{HR,BS(t)} be the joint CSI,where Hk,R is the channel gain between the k-th user and theRS, and HR,BS is the channel gain between the RS and BS.
We consider the block fading channel where the small scale
fading coefficient remains quasi-static within a scheduling slot
and i.i.d. between scheduling slots. We denote by H(t) andHk,R(t) the global and local CSI at time-slot t, respectively.Data packets arrive according to K random arrival pro-
cesses, A(t) = {Ak(t), ∀k} where we assume Ak(t) is i.i.d.over scheduling time-slots based on a general distribution
fA(a) with E[Ak(t)] = λk. Let yk(t) ∈ {0, 1} denote the
time-slot allocation index for the k-th user, i.e., yk(t) = 1
Workshop on Cooperative and Cognitive Mobile Networks
5553
2
determines that the k-th user is scheduled to transmit in the
t + 1-st time-slot and yk(t) = 0, otherwise. The maximum
achievable data rate in bit-per-second for the transmission of
the k-th user is given by
RS,k = yklog(1 + pk|Hk,R|2), k = 1, 2, . . .K, (1)
where pk is the transmitted power of the k-th user, and
the noise power is normalized to 1 for simplicity. Similarly,
for the second phase from RS to BS we have RR,k =yklog(1 + pR|HR,BS |2) where pR is the transmission power
of the RS. We consider deterministic packet length for
each user’s application, hence RS,k and RR,k can be con-
verted to packet-per-second through a proportionality constant.
Accordingly, the queue dynamics for user k is given by
QS,k(t+1) = min{[QS,k(t)−RS,k(t)]++Ak(t), LQ} wherex+ = max{x, 0}, LQ denotes the maximum buffer size for all
users. Furthermore, the queue dynamics of the k-th partition ofthe RS is given by QR,k(t+1) = min{[QR,k(t)−RR,k(t)]++RS,k(t), LQ}, where RR,k is the packet rate of the second
phase from RS to BS.
We denote by S(t) = (H(t),Q(t)) the global system state
at the t-th slot. Given an observed system state realization
(S), the scheduled transmitter may adjust the transmit power
allocation according to a stationary policy. A stationary time-
slot and transmit power allocation policy (Π) maps the systemstate S to the actions space U . Π is called feasible if the
associated actions satisfy the optimization constraints (e.g.,
average transmit power and users’ fairness constraints in our
problem). It was shown in [6] that S(t + 1) only depends
on S(t) and actions at t-th time-slot, and hence the induced
random process {S(t)} for a given control policy Π, is
Markovian with the following transition probability:
Pr{S(t+ 1)|S(t),Π(t)} =
Pr{H(t+ 1)}Pr{Q(t+ 1)|S(t),Π(S(t))}, (2)
where the queue dynamic transition probability kernel
Pr{Q(t+ 1)|S(t),Π(S(t))} is given by
Pr{Q(t+ 1)|S(t),Π(S(t))} =
Pr{Ak∗(t) = QS,k∗(t+ 1)− [QS,k∗(t)−RS,k∗ ]+}Iψ,(3)
where condition ψ is true when QS,k(t + 1) = QS,k(t) +Ak(t), ∀k �= k∗, QR,k∗(t + 1) = QR,k∗(t) + RS,k∗(t),and QR,k(t + 1) = QR,k(t), ∀k �= k∗ are true. It should
be mentioned that low-load regime and negligible blocking
probability assumption is considered in (3). Given a unichain
policy Π, the induced Markov chain {S(t)} is ergodic and
there exists a unique steady state distribution πS [5].
The average number of packets that arrive at the k-th MS
is given by λk(1− P bk) where λk and P bk are arrival rate and
blocking (dropping) probability for MSk. This is the same
as the average number of packets received by the relay’s
corresponding buffer as the two buffers are in tandem. Thus, by
Little’s law, the average time a packet spends in the e2e system
is∑
k(QS,k+QR,k)
λk(1−P bk )
. For a sufficiently small packet dropping
rate (sufficiently large buffer and low-load regime) 1−P bk ≈ 1,
the average e2e delay of the two-hop relay-assisted system is
Dk = limT→∞
1
T
T∑t=1
EΠ
[K∑k=1
QS,k(t) +QR,k(t)
λk
]
= EπS
[K∑k=1
QS,k +QR,kλk
],
(4)
where EπS denotes the expectation with respect to the induced
steady state distribution. We assume the average arrival rate of
the packets is such that the users’ queues are considered stable.
From this assumption the stability of the RS’s queues follows.
Therefore, the overflow rate of the system is considered
negligible. Similarly, each MS’s average power constraint and
temporal fairness constraint are given by
P k = limT→∞
1
T
T∑t=1
EΠ [yk(t)pk(t)] = E
πS [ykpk] ≤ PU , (5)
Y k = limT→∞
1
T
T∑t=1
EΠ [yk(t)] = E
πS [yk] ≥ φk, (6)
where PU is the user’s maximum transmit power and φkdenotes the minimum relative frequency at which user kshould be chosen, with φk ≥ 0 and
∑Kk=1 φk ≤ 1.
III. CMDP PROBLEM FORMULATION
In this section, we shall formulate the delay-optimal prob-
lem as an infinite-horizon average cost CMDP and discuss
the general solution. The goal of the scheduler is to choose
an optimal stationary feasible unichain policy Π such that
the average end-to-end delay (4) is minimized subject to the
average power constraint (5) and temporal fairness constraint
(6) at each user node. This problem is an infinite-horizon
average cost CMDP with system state space S = Q × H,action space U = P × Y , where P = {pk, ∀k} is the powerallocation action space, and Y = {yk, ∀k} is the user selectionaction space. The transition kernel is given by (2), and per-
stage cost function d (S,Π(S)) =∑Kk=1
(QS,k+QR,k)λk
.
The above CMDP problem can be converted into an uncon-
strained MDP by Lagrange theory. We define the Lagrangian
as L(Π, γ) = limT→∞
1/T∑Tt=1 E
Π [L (S(t),Π(S(t)), γ)],
where
L (S(t),Π(S(t)), γ) =K∑k=1
[QS,k(t) +QR,k(t)
λk+
γp,k(yk(t)pk(t)− PU )− γy,k(yk(t)− φk)],
(7)
in which γ = [γp,1, . . . , γp,K , γy,1, . . . , γy,K ] is the Lagrangemultipliers (LM) vector and the corresponding unconstrained
MDP is given by
G(γ) = minΠ
{L(Π, γ)}
= minΠ
{limT→∞
1
T
T∑t=1
EΠ [L (S(t),Π(S(t)), γ)]
},(8)
where G(γ) gives the Lagrange dual function. It wa shown
in [7] that there exists a Lagrange multiplier γ ≥ 0 such that
5554
3
Π∗ minimizes L(Π, γ) and the saddle point condition holds.
For a given LM vector, the optimizing unichain policy for the
unconstrained MDP (8) can be obtained by solving the asso-
ciated Bellman equation w.r.t. (θ, {J(S)}) for i = 1, . . . , |S|as:
θ + J(Si) =
minΠ(Si)
{L(Si,Π(Si), γ
)+
∑Sj
Pr[Sj |Si,Π(Si)]J(Sj)
},
(9)
where J{S} is the value function of the MDP and
Pr[Sj |Si,Π(Si)] is the transition kernel which can be obtainedfrom (2), θ = minΠ {L(Π, γ)} is the optimal average cost
per stage and the optimizing policy is Π∗.The Bellman equation in (9) is very complicated to solve
due to the curse of dimensionality and a brute-force solution
could not lead to any useful implementation. As was shown in
[2], (9) can be simplified into an equivalent form by exploiting
the i.i.d. structure of the CSI process H(t).
IV. LINEAR VALUE FUNCTION APPROXIMATION SCHEME
The control policy obtained by solving (9) is the same as
that obtained by solving the following equivalent Bellman
equation [6]:
θ + V (Qi) = minΠ(Qi)
{L (Qi,Π(Qi), γ
)+∑
Qj
Pr[Qj |Qi,Π(Qi)]V (Qj)}, ∀Qi,(10)
where V (Qi) = EH[J(Qi,H)|Qi] is the conditional aver-
age value function for state Qi, and L(Qi,Π(Qi), γ
)=
EH[L(Qi,Π(Qi), γ
) |Qi] is the conditional per-stage cost.To further simplify the solution to (10), the following linear
approximation of the value function can be used:
V (Q) =K∑k=1
LQ∑q=0
[VS,k(q)I [QS,k = q] + VR,k(q)I [QR,k = q]
],
(11)
where {VS,k(q)}, and {VR,k(q)}∀k = 1, . . . ,K are called
the per-node value functions at each MS and RS respectively,
and {V (Q)} is the global value function. Compared with the
original value function in (9), the dimension of the per-node
value functions is much smaller. Therefore, the per-node value
function can only satisfy (9) in some predetermined system
queue states, which are referred to as the representative states.
Let Qrep = {δk,q, ζk,q|∀k = 1, . . . ,K; q = 1, . . . , LQ} be therepresentative states where δk,q and ζk,q correspond to users’
and relay’s states respectively. Without loss of generality, we
let VS,1(0) = · · · = VS,K(0) and VR,1(0) = · · · = VR,K(0)and set QI = (0, . . . , 0) denoting that all buffers are empty asthe reference state.
A. Obtaining the Control Policy Using Per-node Value Func-tions
Using the approximate value function in (11), we shall
derive a distributed control policy which depends on the local
CSI and local QSI as well as the per-node value functions at
each node k (∀k = 1, . . . ,K) as follows:
Π∗(Qi) = arg minΠ(Qi)
{L (Qi,Π(Qi), γ
)+
∑Qj
Pr[Qj |Qi,Π(Qi)]V (Qj)}
= arg minΠ(Qi)
{∑k
[QiS,k +QiR,k
λk+
∑a
fA(a)V (Qin)
]+
EH
[∑k
Fk(yk, pk)
]}⇔ arg min
Π(Qi)
{EH
[∑k
Fk(yk, pk)
]},
(12)
where Qin = [QiS,1 + a, . . . , QiS,K + a,QiR,1, . . . , Q
iR,K ] and
Fk(yk, pk) = γp,kykpk + γy,kyk +∑a fA(a)[VS,k(Q
iS,k −
RS,k + a)− VS,k(QiS,k + a)] + VR,k(QiR,k +RS,k −RR,k)−
VR,k(QiR,k +RS,k).
Lemma 1 (Distributed Control Policy): Given {Vk(q)} andQi,H, the following distributive control solves (12) (∀k =1, . . . ,K):
• Power control for the MS-RS link: p∗k =arg min
pk{Fk(yk, pk)},
• User selection index: y∗k = arg minyk{Fk(yk, p∗k)}.
As for the second phase, from RS to BS, we employ the
same policy as the first phase, i.e. the relay transmits RR,k(t)packets from its k∗-th queue to the BS. The rate (number of
packets) of transmission can be calculated in a way similar to
that in (1). Using Lemma 1, we can obtain the control policy in
a distributive manner. To this end, we need to design an online
learning algorithm to estimate the per-node value functions in
MSs and RS as well as the LMs.
B. Online Distributed Stochastic Learning Algorithm
Each MS initiates its per-node value functions and LMs,
denoted as {V 0S,k(q)}, {γ0p,k}, and {γ0y,k} as well as the per-
node value functions for the RS node, denoted as {V 0R,k(q)}.
At the beginning of the t-th frame, the RS node broadcasts
its corresponding QSI QR,k(t) to each MS nodes. Based on
the local system information (QS,k(t), QR,k(t),Hk(t)) and
the per-node value functions {V tS,k(q)} and {V tR,k(q)}, eachMS determines the distributive control actions including the
users’ power allocation p∗k, users’ time-slot allocation index
y∗k. Each MS updates the per-node value functions {V t+1S,k (q)},
and {V t+1R,k (q)} as well as the LMS {γt+1
p,k , γt+1y,k } according
to:
V t+1S,k (q) = V tS,k(q) + εtv[q + Fk(y
∗k, p
∗k)− V tS,k(q)] (13)
×I[Q(t) = δk,q],
V t+1R,k (q) = V tR,k(q) + εtv[q + Fk(y
∗k, p
∗k)− V tR,k(q)] (14)
×I[Q(t) = ζk,q],
γt+1p,k =
[γtp,k + εtp(pk(t)− PU )
]+, (15)
γt+1y,k =
[γty,k + εty(yk(t)− φk)
]+, (16)
5555
4
M−1 =
⎡⎣
0 I[Q1 = δ1,1] . . . I[Q1 = δ1,LQ ] . . . 0 I[Q1 = δK,1] . . . I[Q1 = δK,LQ ]. . . . . . . . . . . . . . . . . . . . .
0 I[Q|Q| = δ1,1] . . . I[Q|Q| = δ1,LQ ] . . . 0 I[Q|Q| = δK,1] . . . I[Q|Q| = δK,LQ ]
⎤⎦
T
. (17)
where {εtv}, {εtp}, {εty} are the step size sequences satisfy-ing:
∑∞t=0 ε
tv = ∞, εtv > 0, limt→∞ εtv = 0;
∑∞t=0 ε
tp = ∞,
εtp > 0, limt→∞ εtp = 0;∑∞t=0 ε
ty =∞, εty > 0, limt→∞ εty =
0;∑∞t=0[(ε
tv)
2+(εtp)2+(εty)
2] <∞, limt→∞ εtp/εtv = 0, and
limt→∞ εty/εtp = 0.
While a brute-force centralized solution will lead to enor-
mous complexity as well as signalling loading to deliver the
global CSI and QSI to the controller, the computational com-
plexity of the online stochastic learning algorithm executed
at each node grows only linearly with the number of nodes,
i.e., O(K). This is because determining the per-node value
function at any user depends only on its own queue state and
its counterpart at RS, feedback by RS at the beginning of each
time-slot, and independent of the states of the other users.
Thus, similar computational complexity as in [2] and [8] is
incurred.
C. Proof of Convergence of the Distributed Online LearningAlgorithm
We shall establish technical conditions for the almost-
sure convergence of the online distributive learning algorithm.
Since {εtv}, {εtp}, {εty} satisfy εtp = o(εtv), εty = o(εtp), theLMs update and the per-node potential functions update are
done simultaneously but over two different time scales. During
the per-node potential functions update (timescale I), we have
γt+1p,k − γtp,k = O(εtp) = o(εtv) and γ
t+1y,k − γty,k = O(εty) =
o(εtv). Therefore, the LMs appear to be quasi-static [?] duringthe per-node value function update in (13), (14). Due to the
brevity of the paper, we only show the sketch of the proof for
the users’ value function VS,k(q), as similar approach can be
taken for VR,k(q). For details, refer to [10].
Let the sequence of matrices {Xt} and {Zt} as Xt−1 =(1 − εt−1
v )I + M−1P(Πt)Mεt−1v and Zt−1 = (1 − εt−1
v )I +M−1P(Πt−1)Mεt−1
v , where Πt is a unichain system control
policy at the t-th frame, P(Πt) is the transition probability
matrix of system states given the unichain system control
policy Πt, I is identity matrix and M is given in (17) on top
of the next page, as in [2]. Assume for all the feasible policy
in the policy space, there exists some positive integer β and
τβ > 0 such that[Xβ−1 . . .X1
](r,I)
≥ τβ ,[Zβ−1 . . .Z1
](r,I)
≥ τβ , ∀r (18)
where [.](r,I) denotes the element in r-th row and
I-th column (I corresponds to the reference state
QI). Equation (11) can be written as V = MW or
W = M−1V, where V = [V (Q1), . . . , V (Q|Q|)]T and
W = [WTS WT
R]T is the parameter vector. Moreover, WS =
[VS,1(0), . . . , VS,1(LQ), . . . , VS,K(0), . . . , VS,K(LQ)]T and
WR = [VR,1(0), . . . , VR,1(LQ), . . . , VR,K(0), . . . , VR,K(LQ)]T .
It can be shown that the following statements are true.
The update of the parameter vector (or per-node
potential vector) will converge almost surely for any
given initial parameter vector W0 and LMs γ, i.e.,
limt→∞ Wt(γ) = W∞(γ). Therefore, the steady state
parameter vector W∞ satisfies
θe + W∞(γ) = M−1T(γ,MW∞(γ)), (19)
where θ is a constant, e is a K(LQ + 1) × 1 vector with all
elements equal to 1, and the mapping matrix T is defined as
T(γ,V) = minΠ[L(γ,Π) + P(Π)V]. Now note that (19) is
equivalent to the following Bellman equation on the represen-
tative states of Qrepθ + V∞
k (q) = minΠ(δk,q)
{L (δk,q,Π(δk,q), γk)
+∑Qj
Pr[Qj |δk,q,Π(δk,q)]K∑k=1
V∞k (Qjk)}, ∀δk,q ∈ Qrep.
(20)
Hence, (19) and (20) basically guarantee that the proposed on-
line learning algorithm will converge to the best fit parameter
vector (per-node potential) satisfying (11). On the other hand,
since the ratio of step sizes satisfies εtp/εtv, ε
ty/ε
tv → 0 during
the LM update (timescale II), the per-node value function will
be updated much faster than the LMs. By Corollary 2.1 of
[9], we have limt→∞ ‖Vtk − V∞k (γt)‖ = 0 with probability
1 (w.p.1). Moreover, for the convergence of the LMs over
timescale II, we claim that the iteration on the vector of LMs
γ = [γp,1, . . . , γp,K , γy,1, . . . , γy,K ]T converges almost surely
to γ∗ = [γ∗p,1, . . . , γ∗p,K , γ
∗y,1, . . . , γ
∗y,K ]T which satisfies the
power and users’ fairness constraints in (5), and (6). For
the same conditions as mentioned in Section IV.B, we have
γt,Wt → (γ∗,W∞(γ∗)) w.p.1, where (γ∗,W∞(γ∗)) satisfiesθe + W∞(γ∗) = M−1T(γ∗,MW∞(γ∗)) and the power and
users’ fairness constraints in (5), and (6).
V. PERFORMANCE EVALUATION
By simulations, we shall compare our proposed distributed
online per-node value function learning algorithm to two
reference benchmarks. One is the traditional round-robin (RR)
scheme which is a non-opportunistic scheduling policy that
schedules users in a predetermined order. At time slot t, the(t mod K+1)-th user is chosen. The other is the online rate
equivalent (ORA) scheme in [8] which converts a CMDP
problem corresponding to a 1-hop multiuser system into Ksubproblems corresponding to a single-user system. In this
approach the minimum required rate of each user is computed
and user scheduling is done in a greedy method.
We assume the total bandwidth is 1 MHz, the packet
arrival at each user node is Poisson with average arrival
rate λ = 20 packets/s and deterministic packet size in each
time-slot. Arrivals are generated in an i.i.d. manner across
slots. We consider Rayleigh fading channel model for the
first hop, where each user’s channel state Hk,R is selected
from the probability density function expressed as fH(h) =
5556
5
2 4 6 8 10 12 14 1650
51
52
53
54
55
56
57
58
Average 1st−hop SNR (dB)
Ave
rage
end
−to−
end
dela
y (ti
me−
slot
)
Proposed scheme1−hop rate equivalent schemeRound−robin scheme
Fig. 2. Average e2e delay versus 1st-hop SNR for a 10-user system.
2 3 4 5 6 7 85.2
5.4
5.6
5.8
6
6.2
6.4
6.6
6.8
7
Number of users
Ave
rage
end
−to−
end
dela
y pe
r use
r (tim
e−sl
ot)
Proposed scheme1−hop rate equivalent schemeRound−robin scheme
Fig. 3. Average e2e delay per user versus the number of users for K = 10users, LQ = 10, and SNR = 6 dB.
hα2 exp
(−h2
2α2
), h ≥ 0, with α = −4 dB. We consider a 10-
user system with the maximum buffer size of LQ = 10 for
each queue in the system and maximum power of PU = 1 W .
The average e2e delay of the system versus average SNR
of the first-hop is illustrated in Fig. 2. It can be observed that
the proposed distributive algorithm could achieve significant
performance gain in average delay over the ORA scheme.
As expected the e2e delay in RR scheme does not vary with
channel gains.
Fig. 3 compares the delay performance of the three ap-
proaches with different number of users. The average transmit
SNR for each user is 6 dB. As the number of users related to
the BS grows, the average rate of e2e delay increase per user
is much higher for RR and ORA schemes. This confirms the
convergence of the proposed scheme to an optimal solution
even in a relatively-large state space.
Fig. 4 indicates the long-term time fraction allocations of all
10 users under the various scheduling policies for the problem.
For each user, the rightmost bar shows the minimum time
fraction requirement (φk). The remaining three bars representthe time fraction allocated to this user in the three policies
evaluated here.
Fig. 5 shows the convergence property of the approximate
MDP approach using distributed stochastic learning. We plot
the average per-node value functions of the users versus the
scheduling slot index at a transmit SNR = 10 dB. It can
be seen that the distributed algorithm converges quite fast
Fig. 4. Time fraction allocation (temporal fairness) in a 10-user system.
0 50 100 150 200 250 3005
10
15
20
25
30
35
40
45
50
55
60
Number of iterations
Val
ue fu
nctio
n
VS,1(10)
VS,1(9)
VS,1(8)
VS,1(7)
VS,1(6)
VS,1(5)
VS,1(4)
VS,1(3)
VS,1(2)
VS,1(1)
Fig. 5. Convergence property for the distributed online learning algorithmfor K = 10 users, LQ = 10, and SNR = 10 dB.
and after 200 iterations the values are extremely close to the
final converged results. Similar results can be seen for per-
node value function at the RS. The average delay per user
corresponding to the average per-node potential functions at
the 200-th iteration is 5.85, which is smaller than those of 5.95
of ORA scheme and 6.92 of RR benchmark.
REFERENCES
[1] M. J. Neely, “Optimal energy and delay tradeoffs for multiuser wirelessdownlinks,” IEEE Trans. on Inform. Theory, vol. 53, no. 9, pp. 3095-3113,Sept. 2007.
[2] Y. Cui and V. K. N. Lau, “Distributive stochastic learning for delay-optimal OFDMA power and subband allocation,” IEEE Trans. on SignalProcess., vol. 58, no. 9, pp. 4848-4858, Sept. 2010.
[3] R. Knopp and P. A. Humblet, “Information capacity and power controlin single cell multiuser communications,” Proc. IEEE International Con-ference on Communications, pp. 331-335, June 1995.
[4] X. Liu, E. K. P. Chong and N. B. Shroff, “Opportunistic transmissionscheduling with resource-sharing constraints in wireless networks,” IEEEJ. Sel. Areas Commun., vol. 19, no. 10, pp. 2053-2064, Oct. 2001.
[5] D. P. Bertsekas, Dynamic Programming: Deterministic and StochasticModels, Prentice-Hall, 1987, Englewood Cliffs, NJ, U.S.A.
[6] V. K. N. Lau and Y. Cui, “Delay-optimal power and subcarrier allocationfor OFDMA systems via stochastic approximation,” IEEE Trans. onWireless Commun., vol. 9, no. 1, pp. 227-233, Jan. 2010.
[7] V. S. Borkar, “An actor-critic algorithm for constrained Markov decisionprocesses,” System Control Letters, vol. 54, pp. 207-213, 2005.
[8] N. Salodkar, A. Karandikar, and V. S. Borkar, “A stable online algo-rithm for energy-efficient multiuser scheduling,” IEEE Trans. on MobileComputing, vol. 9, no. 10, pp. 1391-1406, Oct. 2011.
[9] V. S. Borkar, “Stochastic approximation with two time scales,” SystemsControl Lett. 29, pp. 291-294, 1997.
[10] http://www.ee.umanitoba.ca/∼ekram/convergence-proof.pdf
5557