distributed algorithms for learning and cognitive medium
TRANSCRIPT
Distributed Algorithms for Learning andCognitive Medium Access with Logarithmic Regret
Anima Anandkumar
Electrical Engineering and Computer Science
University of California Irvine
Joint work with Nithin Michael and Ao Tang, Cornell University
USC EE Systems Seminar
Anima Anandkumar (UCI) Learning and MAC 10/08/2010 1 / 32
Introduction: Distributed Medium Access
SPECTRUM
Constraints of users
Sensing constraints: Sense only part of spectrum at any time
Local information: No centralized control
Lack of coordination: Collisions among users
Unknown channel conditions Lost opportunities
Cooperation and Competition among the Users
Distributed and Adaptive Medium Access ControlAnima Anandkumar (UCI) Learning and MAC 10/08/2010 2 / 32
Introduction: Cognitive Radio Networks
Two types of users
Primary UsersI Priority for channel access
Secondary or Cognitive UsersI Opportunistic accessI Channel sensing abilities
Primary UserSecondary User
Key Difference
Availability of Sensing Samples: Leverage for learning channel conditions
Anima Anandkumar (UCI) Learning and MAC 10/08/2010 3 / 32
Performance Measures for Distributed Mechanisms
Questions
Can sensing samples be used to learn channels with good availability?
Can we design distributed medium access to avoid collisions amongthe users?
What is the cost of learning and distributed access?
Learning Criteria
Consistency: Learn the good channels with large number of sensingsamples
Low Regret: Minimize access of bad channels
Channel Access Criteria
Orthogonalization: Minimize collisions among users
Maximize total secondary throughput under distributed learning and access
Anima Anandkumar (UCI) Learning and MAC 10/08/2010 4 / 32
Setup: Distributed Learning and Access
No. of channels C
µ1 µ2 µC
Slotted tx. with U cognitive users and C > U channels
Channel Availability for Cognitive Users: Mean availability µi forchannel i and µ = [µ1, . . . , µC ].
µ unknown to secondary users: learning through sensing samples
No explicit communication/cooperation among cognitive users
Objectives for secondary users
Users ultimately access orthogonal channels with best availabilities µ
Max. Total Cognitive System Throughput ≡ Min. Regret
Anima Anandkumar (UCI) Learning and MAC 10/08/2010 5 / 32
Setup: Distributed Learning and Access
No. of channels C
µ∗
1 µ∗
2 µ∗
C>>>
Slotted tx. with U cognitive users and C > U channels
Channel Availability for Cognitive Users: Mean availability µi forchannel i and µ = [µ1, . . . , µC ].
µ unknown to secondary users: learning through sensing samples
No explicit communication/cooperation among cognitive users
Objectives for secondary users
Users ultimately access orthogonal channels with best availabilities µ
Max. Total Cognitive System Throughput ≡ Min. Regret
Anima Anandkumar (UCI) Learning and MAC 10/08/2010 5 / 32
Setup: Distributed Learning and Access
No. of channels C
µ∗
1 µ∗
2 µ∗
C>>>
Slotted tx. with U cognitive users and C > U channels
Channel Availability for Cognitive Users: Mean availability µi forchannel i and µ = [µ1, . . . , µC ].
µ unknown to secondary users: learning through sensing samples
No explicit communication/cooperation among cognitive users
Objectives for secondary users
Users ultimately access orthogonal channels with best availabilities µ
Max. Total Cognitive System Throughput ≡ Min. Regret
Anima Anandkumar (UCI) Learning and MAC 10/08/2010 5 / 32
Summary of Results: Three Algorithms
Distributed algorithms based on local information
Performance guarantees under self play
ρPRE: under pre-allocated ranks among cognitive usersI Learning of channels corresponding to assigned ranks
ρRAND: no assigned ranks but number of secondary users knownI Learning channel ranks and adapting to collisions
ρEST: no assigned ranks and unknown number of secondary usersI Learning channel ranks, adapting to collisions and estimating number
of users based on number of collisions
Anima Anandkumar (UCI) Learning and MAC 10/08/2010 6 / 32
Summary of Results (Contd.)
Provable guarantees on sum regret under the three policiesI Convergence to optimal configuration (orthogonal occupancy in the
best channels)I Regret grows in no. of access slots as R(n) ∼ O(log n) for ρPRE and
ρRAND
I Regret grows in no. of access slots as R(n) ∼ O(f(n) log n) for anyf(n)→∞ for ρEST
Lower bound for any uniformly-good policy: also logarithmic in no. ofaccess slots R(n) ∼ Ω(log n)
We propose order-optimal distributed learning and allocation policies
A. Anandkumar, N. Michael, and A.K. Tang, “Opportunistic Spectrum Access with Multiple
Users: Learning under Competition” in Proc. of INFOCOM, (San Deigo, USA), Mar. 2010.
A. Anandkumar, N. Michael, A.K. Tang, and A. Swami, “Distributed Learning and Allocation of
Cognitive Users with Logarithmic Regret”, to appear, IEEE JSAC on Cognitive Radio.
Anima Anandkumar (UCI) Learning and MAC 10/08/2010 7 / 32
Related Work
Multi-armed Bandits
Single cognitive user (Lai & Robbins 85)
Multiple users with centralized allocation (Ananthram et. al 87)Key Result: Regret R(n) ∼ O(log n) and optimal as n→∞
Auer et. al. 02: order optimality for sample mean policies
Cognitive Medium Access & Learning
Liu et. al. 08: Explicit communication among users
Li 08: Q-learning, Sensing all channels simultaneously
Liu & Zhao 10: Learning under time division access: users do notorthogonalize to different channels
Gai et. al. 10: Combinatorial bandits, centralized learning underheterogeneous channel availability
Anima Anandkumar (UCI) Learning and MAC 10/08/2010 8 / 32
Outline
1 Introduction
2 System Model
3 Recap of Bandit Results
4 Proposed Algorithms & Lower BoundLearning Under Pre-allocationLearning with Random AllocationUnknown No. of Cognitive UsersLower Bound on Distributed Learning
5 Simulation Results
6 Conclusion
Anima Anandkumar (UCI) Learning and MAC 10/08/2010 9 / 32
System Model
Primary and Cognitive Networks
Slotted tx. with U cognitive users and C channels
Primary Users: IID tx. in each slot and channelI Channel Availability for Cognitive Users: In each slot, IID with prob.
µi for channel i and µ = [µ1, . . . , µC ].
Perfect Sensing: Primary user always detected
Collision Channel: tx. successful only if sole user
Equal rate among secondary users:Throughput ≡ total no. of successful tx.
No. of channels C
µ1 µ2 µC
Anima Anandkumar (UCI) Learning and MAC 10/08/2010 10 / 32
Problem Formulation
Distributed Learning Through Sensing Samples
No information exchange/coordination among secondary users
All secondary users employ same policy
Throughput under perfect knowledge of µ and coordination
S∗(n;µ, U) := n
U∑
j=1
µ(j∗)
where j∗ is jth largest entry in µ and n: no. of access slots
Regret under learning and distributed access policy ρ
Loss in throughput due to learning and collisions
R(n;µ, U, ρ) := S∗(n;µ, U)− S(n;µ, U, ρ)
Max. Throughput ≡ Min. Sum Regret
Anima Anandkumar (UCI) Learning and MAC 10/08/2010 11 / 32
Outline
1 Introduction
2 System Model
3 Recap of Bandit Results
4 Proposed Algorithms & Lower BoundLearning Under Pre-allocationLearning with Random AllocationUnknown No. of Cognitive UsersLower Bound on Distributed Learning
5 Simulation Results
6 Conclusion
Anima Anandkumar (UCI) Learning and MAC 10/08/2010 12 / 32
Single Cognitive User: Multi-armed Bandit
No. of channels C
µ1 µ2 µC−1 µC
Exploration vs. Exploitation Tradeoff
Exploration: channels with good availability are not missed
Exploitation: obtain good throughput
Explore in the beginning and exploit in the long run
Anima Anandkumar (UCI) Learning and MAC 10/08/2010 13 / 32
Single Cognitive User: Multi-armed Bandit
No. of channels C
µ∗
1 µ∗
2 µ∗
C−1 µ∗
C>>>
Exploration vs. Exploitation Tradeoff
Exploration: channels with good availability are not missed
Exploitation: obtain good throughput
Explore in the beginning and exploit in the long run
Anima Anandkumar (UCI) Learning and MAC 10/08/2010 13 / 32
Single Cognitive User: Multi-armed Bandit
No. of channels C
µ∗
1 µ∗
2 µ∗
C−1 µ∗
C>>>
Exploration vs. Exploitation Tradeoff
Exploration: channels with good availability are not missed
Exploitation: obtain good throughput
Explore in the beginning and exploit in the long run
Anima Anandkumar (UCI) Learning and MAC 10/08/2010 13 / 32
Single Cognitive User: Multi-armed Bandit (Contd.)
Ti(n): no. of slots where user j selects channel i
Xi(Ti(n)): sample mean availability of channel i acc. to user j
1∗: channel with the highest mean availability µ(1∗) ≥ µj, ∀j.
1-worst channel: channels which have lower availability than 1∗
Two Policies based on Sample Mean (Auer et. al. 02)
Deterministic Policy: Select channel with highest g-statistic:
g(i;n) := Xi(Ti(n)) +
√2 log n
Ti(n)
Randomized Greedy Policy: Select channel with highest X i(Ti(n))with prob. 1− εn and with prob. εn unif. select other channels, where
εn := min[β
n, 1]
Anima Anandkumar (UCI) Learning and MAC 10/08/2010 14 / 32
Policies with Logarithmic Regret
Simplification of Regret R(n) := S∗(n)− S(n)
R(n) =∑
i∈1-worst
∆(1∗, i)E[Ti(n)].
where ∆(1∗, i) := µ(1∗)− µ(i).
Theorem (Deterministic Policy)
Time spent in any channel i which is 1-worst under g-statistic policy,
where g(i;n) := X i(Ti(n)) +√
2 lognTi(n)
is
E[Ti(n)] ≤ ∆(1∗, i)
[8 log n
∆(1∗, i)2+ 1 +
π2
3
], ∀i = 1, . . . , C, i ∈ 1-worst.
Anima Anandkumar (UCI) Learning and MAC 10/08/2010 15 / 32
Policies with Logarithmic Regret (Contd.)
Theorem (Randomized Policy)
No. of slots a channel i 6= 1∗ is accessed under randomized greedy policy
satisfies
E[Ti(n)] ≤β
Clog n+ δ, ∀i = 1, . . . , C, i ∈ 1-worst,
when
β > max[20,4
∆2min
],
where ∆min := mini∈1-worst
∆(1∗, i) is minimum separation.
Regret under the two policies is O(logn) for n no. of access slots
Anima Anandkumar (UCI) Learning and MAC 10/08/2010 16 / 32
Lower Bound on Regret
Uniformly good policy ρ
A policy which enables the single cognitive user to ultimately settle downin the best channel under any channel availabilities µ and the user spendsmost of time in the best channel
Eµ[n− Ti(n)] = o(nα), ∀α > 0,µ ∈ (0, 1)C , i ∈ 1-worst.
Satisfied by the two policies of Auer et. al.
Theorem (Lower Bound, Lai & Robbins 85)
Time spent in a 1-worst channel under any uniformly good policy satisfies
limn→∞
P
[Ti(n; ρ) ≥
(1− ε) log n
D(µi, µ1∗);µ, ρ
]= 1,∀i ∈ 1-worst
Hence, R(n) = Ω(log n) for any uniformly good policy.
Anima Anandkumar (UCI) Learning and MAC 10/08/2010 17 / 32
Outline
1 Introduction
2 System Model
3 Recap of Bandit Results
4 Proposed Algorithms & Lower BoundLearning Under Pre-allocationLearning with Random AllocationUnknown No. of Cognitive UsersLower Bound on Distributed Learning
5 Simulation Results
6 Conclusion
Anima Anandkumar (UCI) Learning and MAC 10/08/2010 18 / 32
Outline
1 Introduction
2 System Model
3 Recap of Bandit Results
4 Proposed Algorithms & Lower BoundLearning Under Pre-allocationLearning with Random AllocationUnknown No. of Cognitive UsersLower Bound on Distributed Learning
5 Simulation Results
6 Conclusion
Anima Anandkumar (UCI) Learning and MAC 10/08/2010 19 / 32
Learning Under Pre-Allocation
If user j is assigned rank wj , select channel with wthj highest Xi,j(Ti,j(n))
with prob. 1− εn and with prob. εn unif. select other channels, where
εn := min[β
n, 1]
Allows for heterogeneous users
Feedback on collisions not required for learning
Regret: user does not select channel of pre-assigned rank
E[Ti,j(n)] ≤
n−1∑
t=1
εt+1
C+
n−1∑
t=1
(1− εt+1)P[Ei,j(n)], i 6= w∗
j ,
where Ei,j(n) is the error event that wthj highest entry of Xi,j(Ti,j(n)) is
not same as µ∗
wj
Anima Anandkumar (UCI) Learning and MAC 10/08/2010 20 / 32
Regret Under Pre-allocation
Theorem (Regret Under ρPRE Policy)
No. of slots user j accesses channel i 6= w∗
j other than pre-allocated
channel under ρPRE satisfies
E[Ti,j(n)] ≤β
Clog n+ δ, ∀i = 1, . . . , C, i 6= w∗
j ,
when
β > max[20,4
∆2min
],
where ∆min := mini∈U -worst
∆(U∗, i) is minimum separation between a U -worst
channel and the U th channel.
Regret R(n) = O(log n) under ρPRE
Anima Anandkumar (UCI) Learning and MAC 10/08/2010 21 / 32
Outline
1 Introduction
2 System Model
3 Recap of Bandit Results
4 Proposed Algorithms & Lower BoundLearning Under Pre-allocationLearning with Random AllocationUnknown No. of Cognitive UsersLower Bound on Distributed Learning
5 Simulation Results
6 Conclusion
Anima Anandkumar (UCI) Learning and MAC 10/08/2010 22 / 32
Distributed Learning and Randomized Allocation ρRAND
User adaptively chooses rank wj based on feedback for successful tx.
If collision in previous slot, draw a new wj uniformly from 1 to U
If no collision, retain the current wj
Select channel with wthj highest entry:
gj(i;n) := X i,j(Ti,j(n)) +
√2 log n
Ti,j(n)
Upper Bound on Regret
R(n) ≤ µ(1∗)
U∑
j=1
∑
i∈U -worst
E[Ti,j(n) +M(n)]
∑i∈U -worst
Ti,j(n): Time spent in U -worst channels by user j
M(n): No. of collisions in U -best channels
Anima Anandkumar (UCI) Learning and MAC 10/08/2010 23 / 32
Distributed Learning and Randomized Allocation ρRAND
Theorem
Under ρRAND Policy, E[∑
i∈U -worst
Ti,j(n)] and E[M(n)] are O(log n) and hence,
regret is O(log n) where n is the number of access slots.
Proof Steps for E[∑
i∈U -worst
Ti,j(n)]
Bound time spent in U -worst channels by decay of exploration term
Chernoff-Hoeffding bounds for concentration of sample mean channelavailability
Techniques similar to Auer et. al.
Anima Anandkumar (UCI) Learning and MAC 10/08/2010 24 / 32
Idea of Proof for Regret Under ρRAND
Proof for E[M(n)]: no. of collisions in U -best channels
Bound collisions under perfect knowledge of µ as Π(U)
Relate Π(U) with E[M(n)] under learning of µ
Analysis of Π(U)
Markov chain with state space of all possible user configurationswhere the absorbing state is the orthogonal configuration
Π(U): mean time to absorption under uniform randomization ofcolliding users
Π(U) <∞ since finite state Markov chain
Anima Anandkumar (UCI) Learning and MAC 10/08/2010 25 / 32
Idea of Proof for Regret Under ρRAND Contd.,
Bound on E[M(n)] under learning
Good state: all users estimate order of top-U channels correctly,otherwise bad state
Bound separately collisions under good and bad states
Under run of good states: Π(U) mean no. of collisions
Run of bad states is O(log n)
Anima Anandkumar (UCI) Learning and MAC 10/08/2010 26 / 32
Outline
1 Introduction
2 System Model
3 Recap of Bandit Results
4 Proposed Algorithms & Lower BoundLearning Under Pre-allocationLearning with Random AllocationUnknown No. of Cognitive UsersLower Bound on Distributed Learning
5 Simulation Results
6 Conclusion
Anima Anandkumar (UCI) Learning and MAC 10/08/2010 27 / 32
Learning with Unknown No. of UsersNo. of cognitive users U fixed but unknown to policy
Horizon length n known to users
Regret bounds as n→∞
ρEST: Joint learning of channel order and no. of users
Maintain an estimate of no. of users U
Execute ρRAND(n; U) based on g-statistic assuming U no. of users
Update U based on feedback (no. of collisions)
Update Rule for U under ρEST
Fixed threshold functions ξ(n; k) for n = 1, 2, . . . and k = 1, . . . C.
Slow start: Initialize U ← 1.
If no. of collisions in top-U channels exceeds threshold ξ(n; U), then
U ← U + 1
Anima Anandkumar (UCI) Learning and MAC 10/08/2010 28 / 32
Learning with Unknown No. of Users (Contd.)
Theorem
For any class of threshold functions ξ(n; k) satisfying
limn→∞
ξ(n; k)
log n=∞, ∀k > 1,
the regret under ρEST policy satisfies
lim supn→∞
R(n;µ, U, ρEST)
ξ∗(n;U)<∞,
where
ξ∗(n;U) := maxk=1,...,U
ξ(n; k).
Regret is asymptotically (slightly more than) logarithmic under ρEST
Anima Anandkumar (UCI) Learning and MAC 10/08/2010 29 / 32
Proof Steps
Overestimation: Probability that U ≤ U
Conditional regret: Regret under U ≤ U
Overestimation Error
Number of collisions experienced when U = U is Θ(log n)
Applying threshold ξ(n) = ω(log n) ensures that U is not incremented
Conditional Regret
Time spent in U -worst channels is O(log n)
Number of collisions is O(ξ∗)
Anima Anandkumar (UCI) Learning and MAC 10/08/2010 30 / 32
Outline
1 Introduction
2 System Model
3 Recap of Bandit Results
4 Proposed Algorithms & Lower BoundLearning Under Pre-allocationLearning with Random AllocationUnknown No. of Cognitive UsersLower Bound on Distributed Learning
5 Simulation Results
6 Conclusion
Anima Anandkumar (UCI) Learning and MAC 10/08/2010 31 / 32
Lower Bound on Regret
Uniformly good policy ρ
A policy which enables users to ultimately settle down in orthogonal bestchannels under any channel availabilities µ: user j spends most of time inU -best channels and time spent in i ∈ U -worst channel satisfies
Eµ[n− Ti,j(n)] = o(nα), ∀α > 0,µ ∈ (0, 1)C , i ∈ U -worst.
Satisfied by ρPRE and ρRAND policies
Theorem (Lower Bound for Uniformly Good Policy (Liu & Zhao))
The sum regret satisfies
lim infn→∞
R(n;µ, U, ρ)
log n≥
∑
i∈U -worst
U∑
j=1
∆(U∗, i)
D(µi, µj∗).
Order optimal regret under ρPRE and ρRAND policies
Anima Anandkumar (UCI) Learning and MAC 10/08/2010 32 / 32
Outline
1 Introduction
2 System Model
3 Recap of Bandit Results
4 Proposed Algorithms & Lower BoundLearning Under Pre-allocationLearning with Random AllocationUnknown No. of Cognitive UsersLower Bound on Distributed Learning
5 Simulation Results
6 Conclusion
Anima Anandkumar (UCI) Learning and MAC 10/08/2010 33 / 32
Simulation Results
0 1 2 3 4 5
x 104
0
50
100
150
200
250
300
350
400
450
500
Random Allocation
Central Allocation
Distributed Lower Bnd
Centralized Lower Bnd
Normalized regret R(n)log n
vs. n slots.
U = 4 users, C = 9 channels.
1 2 3 4 5 60
50
100
150
200
250
300
350
400
Random Allocation
Central Allocation
Distributed Lower Bnd
Centralized Lower Bnd
Normalized regret R(n)log n
vs. U users.
C = 9 channels, n = 2500 slots.
Probability of Availability µ = [0.1, 0.2, . . . , 0.9].
Anima Anandkumar (UCI) Learning and MAC 10/08/2010 34 / 32
Simulation Results
1 2 3 40
50
100
150
200
250
300
No. of runs with top rank vs. user.
U = 4, C = 9, n = 2500 slots, ρRAND.
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
x 104
0
0.5
1
1.5
2
2.5
3
3.5x 10
4
β = 1000
β = 1200
β = 2000
Regret R(n) vs. n slots (varying β).
U = 4 users, C = 9 channels, ρPRE.
Probability of Availability µ = [0.1, 0.2, . . . , 0.9].
Anima Anandkumar (UCI) Learning and MAC 10/08/2010 35 / 32
Outline
1 Introduction
2 System Model
3 Recap of Bandit Results
4 Proposed Algorithms & Lower BoundLearning Under Pre-allocationLearning with Random AllocationUnknown No. of Cognitive UsersLower Bound on Distributed Learning
5 Simulation Results
6 Conclusion
Anima Anandkumar (UCI) Learning and MAC 10/08/2010 36 / 32
Conclusion
Summary
Considered maximizing total throughput of cognitive users underunknown channel availabilities and no coordination
Proposed two algorithms which achieve order optimalityI ρPRE policy works under pre-allocated ranksI ρRAND policy does not require prior information
Proposed ρEST policy when no. of users is unknownI ρEST has asymptotically (slightly more than) logarithmic regret
A. Anandkumar, N. Michael, and A.K. Tang, “Opportunistic Spectrum Access with Multiple
Users: Learning under Competition” in Proc. of IEEE INFOCOM, (San Deigo, USA), Mar.
2010.
A. Anandkumar, N. Michael, A.K. Tang, and A. Swami, “Distributed Learning and Allocation of
Cognitive Users with Logarithmic Regret”, available on website.
Anima Anandkumar (UCI) Learning and MAC 10/08/2010 37 / 32