distributed algorithms for learning and cognitive medium

41
Distributed Algorithms for Learning and Cognitive Medium Access with Logarithmic Regret Anima Anandkumar Electrical Engineering and Computer Science University of California Irvine Joint work with Nithin Michael and Ao Tang, Cornell University USC EE Systems Seminar Anima Anandkumar (UCI) Learning and MAC 10/08/2010 1 / 32

Upload: others

Post on 17-Mar-2022

3 views

Category:

Documents


0 download

TRANSCRIPT

Distributed Algorithms for Learning andCognitive Medium Access with Logarithmic Regret

Anima Anandkumar

Electrical Engineering and Computer Science

University of California Irvine

Joint work with Nithin Michael and Ao Tang, Cornell University

USC EE Systems Seminar

Anima Anandkumar (UCI) Learning and MAC 10/08/2010 1 / 32

Introduction: Distributed Medium Access

SPECTRUM

Constraints of users

Sensing constraints: Sense only part of spectrum at any time

Local information: No centralized control

Lack of coordination: Collisions among users

Unknown channel conditions Lost opportunities

Cooperation and Competition among the Users

Distributed and Adaptive Medium Access ControlAnima Anandkumar (UCI) Learning and MAC 10/08/2010 2 / 32

Introduction: Cognitive Radio Networks

Two types of users

Primary UsersI Priority for channel access

Secondary or Cognitive UsersI Opportunistic accessI Channel sensing abilities

Primary UserSecondary User

Key Difference

Availability of Sensing Samples: Leverage for learning channel conditions

Anima Anandkumar (UCI) Learning and MAC 10/08/2010 3 / 32

Performance Measures for Distributed Mechanisms

Questions

Can sensing samples be used to learn channels with good availability?

Can we design distributed medium access to avoid collisions amongthe users?

What is the cost of learning and distributed access?

Learning Criteria

Consistency: Learn the good channels with large number of sensingsamples

Low Regret: Minimize access of bad channels

Channel Access Criteria

Orthogonalization: Minimize collisions among users

Maximize total secondary throughput under distributed learning and access

Anima Anandkumar (UCI) Learning and MAC 10/08/2010 4 / 32

Setup: Distributed Learning and Access

No. of channels C

µ1 µ2 µC

Slotted tx. with U cognitive users and C > U channels

Channel Availability for Cognitive Users: Mean availability µi forchannel i and µ = [µ1, . . . , µC ].

µ unknown to secondary users: learning through sensing samples

No explicit communication/cooperation among cognitive users

Objectives for secondary users

Users ultimately access orthogonal channels with best availabilities µ

Max. Total Cognitive System Throughput ≡ Min. Regret

Anima Anandkumar (UCI) Learning and MAC 10/08/2010 5 / 32

Setup: Distributed Learning and Access

No. of channels C

µ∗

1 µ∗

2 µ∗

C>>>

Slotted tx. with U cognitive users and C > U channels

Channel Availability for Cognitive Users: Mean availability µi forchannel i and µ = [µ1, . . . , µC ].

µ unknown to secondary users: learning through sensing samples

No explicit communication/cooperation among cognitive users

Objectives for secondary users

Users ultimately access orthogonal channels with best availabilities µ

Max. Total Cognitive System Throughput ≡ Min. Regret

Anima Anandkumar (UCI) Learning and MAC 10/08/2010 5 / 32

Setup: Distributed Learning and Access

No. of channels C

µ∗

1 µ∗

2 µ∗

C>>>

Slotted tx. with U cognitive users and C > U channels

Channel Availability for Cognitive Users: Mean availability µi forchannel i and µ = [µ1, . . . , µC ].

µ unknown to secondary users: learning through sensing samples

No explicit communication/cooperation among cognitive users

Objectives for secondary users

Users ultimately access orthogonal channels with best availabilities µ

Max. Total Cognitive System Throughput ≡ Min. Regret

Anima Anandkumar (UCI) Learning and MAC 10/08/2010 5 / 32

Summary of Results: Three Algorithms

Distributed algorithms based on local information

Performance guarantees under self play

ρPRE: under pre-allocated ranks among cognitive usersI Learning of channels corresponding to assigned ranks

ρRAND: no assigned ranks but number of secondary users knownI Learning channel ranks and adapting to collisions

ρEST: no assigned ranks and unknown number of secondary usersI Learning channel ranks, adapting to collisions and estimating number

of users based on number of collisions

Anima Anandkumar (UCI) Learning and MAC 10/08/2010 6 / 32

Summary of Results (Contd.)

Provable guarantees on sum regret under the three policiesI Convergence to optimal configuration (orthogonal occupancy in the

best channels)I Regret grows in no. of access slots as R(n) ∼ O(log n) for ρPRE and

ρRAND

I Regret grows in no. of access slots as R(n) ∼ O(f(n) log n) for anyf(n)→∞ for ρEST

Lower bound for any uniformly-good policy: also logarithmic in no. ofaccess slots R(n) ∼ Ω(log n)

We propose order-optimal distributed learning and allocation policies

A. Anandkumar, N. Michael, and A.K. Tang, “Opportunistic Spectrum Access with Multiple

Users: Learning under Competition” in Proc. of INFOCOM, (San Deigo, USA), Mar. 2010.

A. Anandkumar, N. Michael, A.K. Tang, and A. Swami, “Distributed Learning and Allocation of

Cognitive Users with Logarithmic Regret”, to appear, IEEE JSAC on Cognitive Radio.

Anima Anandkumar (UCI) Learning and MAC 10/08/2010 7 / 32

Related Work

Multi-armed Bandits

Single cognitive user (Lai & Robbins 85)

Multiple users with centralized allocation (Ananthram et. al 87)Key Result: Regret R(n) ∼ O(log n) and optimal as n→∞

Auer et. al. 02: order optimality for sample mean policies

Cognitive Medium Access & Learning

Liu et. al. 08: Explicit communication among users

Li 08: Q-learning, Sensing all channels simultaneously

Liu & Zhao 10: Learning under time division access: users do notorthogonalize to different channels

Gai et. al. 10: Combinatorial bandits, centralized learning underheterogeneous channel availability

Anima Anandkumar (UCI) Learning and MAC 10/08/2010 8 / 32

Outline

1 Introduction

2 System Model

3 Recap of Bandit Results

4 Proposed Algorithms & Lower BoundLearning Under Pre-allocationLearning with Random AllocationUnknown No. of Cognitive UsersLower Bound on Distributed Learning

5 Simulation Results

6 Conclusion

Anima Anandkumar (UCI) Learning and MAC 10/08/2010 9 / 32

System Model

Primary and Cognitive Networks

Slotted tx. with U cognitive users and C channels

Primary Users: IID tx. in each slot and channelI Channel Availability for Cognitive Users: In each slot, IID with prob.

µi for channel i and µ = [µ1, . . . , µC ].

Perfect Sensing: Primary user always detected

Collision Channel: tx. successful only if sole user

Equal rate among secondary users:Throughput ≡ total no. of successful tx.

No. of channels C

µ1 µ2 µC

Anima Anandkumar (UCI) Learning and MAC 10/08/2010 10 / 32

Problem Formulation

Distributed Learning Through Sensing Samples

No information exchange/coordination among secondary users

All secondary users employ same policy

Throughput under perfect knowledge of µ and coordination

S∗(n;µ, U) := n

U∑

j=1

µ(j∗)

where j∗ is jth largest entry in µ and n: no. of access slots

Regret under learning and distributed access policy ρ

Loss in throughput due to learning and collisions

R(n;µ, U, ρ) := S∗(n;µ, U)− S(n;µ, U, ρ)

Max. Throughput ≡ Min. Sum Regret

Anima Anandkumar (UCI) Learning and MAC 10/08/2010 11 / 32

Outline

1 Introduction

2 System Model

3 Recap of Bandit Results

4 Proposed Algorithms & Lower BoundLearning Under Pre-allocationLearning with Random AllocationUnknown No. of Cognitive UsersLower Bound on Distributed Learning

5 Simulation Results

6 Conclusion

Anima Anandkumar (UCI) Learning and MAC 10/08/2010 12 / 32

Single Cognitive User: Multi-armed Bandit

No. of channels C

µ1 µ2 µC−1 µC

Exploration vs. Exploitation Tradeoff

Exploration: channels with good availability are not missed

Exploitation: obtain good throughput

Explore in the beginning and exploit in the long run

Anima Anandkumar (UCI) Learning and MAC 10/08/2010 13 / 32

Single Cognitive User: Multi-armed Bandit

No. of channels C

µ∗

1 µ∗

2 µ∗

C−1 µ∗

C>>>

Exploration vs. Exploitation Tradeoff

Exploration: channels with good availability are not missed

Exploitation: obtain good throughput

Explore in the beginning and exploit in the long run

Anima Anandkumar (UCI) Learning and MAC 10/08/2010 13 / 32

Single Cognitive User: Multi-armed Bandit

No. of channels C

µ∗

1 µ∗

2 µ∗

C−1 µ∗

C>>>

Exploration vs. Exploitation Tradeoff

Exploration: channels with good availability are not missed

Exploitation: obtain good throughput

Explore in the beginning and exploit in the long run

Anima Anandkumar (UCI) Learning and MAC 10/08/2010 13 / 32

Single Cognitive User: Multi-armed Bandit (Contd.)

Ti(n): no. of slots where user j selects channel i

Xi(Ti(n)): sample mean availability of channel i acc. to user j

1∗: channel with the highest mean availability µ(1∗) ≥ µj, ∀j.

1-worst channel: channels which have lower availability than 1∗

Two Policies based on Sample Mean (Auer et. al. 02)

Deterministic Policy: Select channel with highest g-statistic:

g(i;n) := Xi(Ti(n)) +

√2 log n

Ti(n)

Randomized Greedy Policy: Select channel with highest X i(Ti(n))with prob. 1− εn and with prob. εn unif. select other channels, where

εn := min[β

n, 1]

Anima Anandkumar (UCI) Learning and MAC 10/08/2010 14 / 32

Policies with Logarithmic Regret

Simplification of Regret R(n) := S∗(n)− S(n)

R(n) =∑

i∈1-worst

∆(1∗, i)E[Ti(n)].

where ∆(1∗, i) := µ(1∗)− µ(i).

Theorem (Deterministic Policy)

Time spent in any channel i which is 1-worst under g-statistic policy,

where g(i;n) := X i(Ti(n)) +√

2 lognTi(n)

is

E[Ti(n)] ≤ ∆(1∗, i)

[8 log n

∆(1∗, i)2+ 1 +

π2

3

], ∀i = 1, . . . , C, i ∈ 1-worst.

Anima Anandkumar (UCI) Learning and MAC 10/08/2010 15 / 32

Policies with Logarithmic Regret (Contd.)

Theorem (Randomized Policy)

No. of slots a channel i 6= 1∗ is accessed under randomized greedy policy

satisfies

E[Ti(n)] ≤β

Clog n+ δ, ∀i = 1, . . . , C, i ∈ 1-worst,

when

β > max[20,4

∆2min

],

where ∆min := mini∈1-worst

∆(1∗, i) is minimum separation.

Regret under the two policies is O(logn) for n no. of access slots

Anima Anandkumar (UCI) Learning and MAC 10/08/2010 16 / 32

Lower Bound on Regret

Uniformly good policy ρ

A policy which enables the single cognitive user to ultimately settle downin the best channel under any channel availabilities µ and the user spendsmost of time in the best channel

Eµ[n− Ti(n)] = o(nα), ∀α > 0,µ ∈ (0, 1)C , i ∈ 1-worst.

Satisfied by the two policies of Auer et. al.

Theorem (Lower Bound, Lai & Robbins 85)

Time spent in a 1-worst channel under any uniformly good policy satisfies

limn→∞

P

[Ti(n; ρ) ≥

(1− ε) log n

D(µi, µ1∗);µ, ρ

]= 1,∀i ∈ 1-worst

Hence, R(n) = Ω(log n) for any uniformly good policy.

Anima Anandkumar (UCI) Learning and MAC 10/08/2010 17 / 32

Outline

1 Introduction

2 System Model

3 Recap of Bandit Results

4 Proposed Algorithms & Lower BoundLearning Under Pre-allocationLearning with Random AllocationUnknown No. of Cognitive UsersLower Bound on Distributed Learning

5 Simulation Results

6 Conclusion

Anima Anandkumar (UCI) Learning and MAC 10/08/2010 18 / 32

Outline

1 Introduction

2 System Model

3 Recap of Bandit Results

4 Proposed Algorithms & Lower BoundLearning Under Pre-allocationLearning with Random AllocationUnknown No. of Cognitive UsersLower Bound on Distributed Learning

5 Simulation Results

6 Conclusion

Anima Anandkumar (UCI) Learning and MAC 10/08/2010 19 / 32

Learning Under Pre-Allocation

If user j is assigned rank wj , select channel with wthj highest Xi,j(Ti,j(n))

with prob. 1− εn and with prob. εn unif. select other channels, where

εn := min[β

n, 1]

Allows for heterogeneous users

Feedback on collisions not required for learning

Regret: user does not select channel of pre-assigned rank

E[Ti,j(n)] ≤

n−1∑

t=1

εt+1

C+

n−1∑

t=1

(1− εt+1)P[Ei,j(n)], i 6= w∗

j ,

where Ei,j(n) is the error event that wthj highest entry of Xi,j(Ti,j(n)) is

not same as µ∗

wj

Anima Anandkumar (UCI) Learning and MAC 10/08/2010 20 / 32

Regret Under Pre-allocation

Theorem (Regret Under ρPRE Policy)

No. of slots user j accesses channel i 6= w∗

j other than pre-allocated

channel under ρPRE satisfies

E[Ti,j(n)] ≤β

Clog n+ δ, ∀i = 1, . . . , C, i 6= w∗

j ,

when

β > max[20,4

∆2min

],

where ∆min := mini∈U -worst

∆(U∗, i) is minimum separation between a U -worst

channel and the U th channel.

Regret R(n) = O(log n) under ρPRE

Anima Anandkumar (UCI) Learning and MAC 10/08/2010 21 / 32

Outline

1 Introduction

2 System Model

3 Recap of Bandit Results

4 Proposed Algorithms & Lower BoundLearning Under Pre-allocationLearning with Random AllocationUnknown No. of Cognitive UsersLower Bound on Distributed Learning

5 Simulation Results

6 Conclusion

Anima Anandkumar (UCI) Learning and MAC 10/08/2010 22 / 32

Distributed Learning and Randomized Allocation ρRAND

User adaptively chooses rank wj based on feedback for successful tx.

If collision in previous slot, draw a new wj uniformly from 1 to U

If no collision, retain the current wj

Select channel with wthj highest entry:

gj(i;n) := X i,j(Ti,j(n)) +

√2 log n

Ti,j(n)

Upper Bound on Regret

R(n) ≤ µ(1∗)

U∑

j=1

i∈U -worst

E[Ti,j(n) +M(n)]

∑i∈U -worst

Ti,j(n): Time spent in U -worst channels by user j

M(n): No. of collisions in U -best channels

Anima Anandkumar (UCI) Learning and MAC 10/08/2010 23 / 32

Distributed Learning and Randomized Allocation ρRAND

Theorem

Under ρRAND Policy, E[∑

i∈U -worst

Ti,j(n)] and E[M(n)] are O(log n) and hence,

regret is O(log n) where n is the number of access slots.

Proof Steps for E[∑

i∈U -worst

Ti,j(n)]

Bound time spent in U -worst channels by decay of exploration term

Chernoff-Hoeffding bounds for concentration of sample mean channelavailability

Techniques similar to Auer et. al.

Anima Anandkumar (UCI) Learning and MAC 10/08/2010 24 / 32

Idea of Proof for Regret Under ρRAND

Proof for E[M(n)]: no. of collisions in U -best channels

Bound collisions under perfect knowledge of µ as Π(U)

Relate Π(U) with E[M(n)] under learning of µ

Analysis of Π(U)

Markov chain with state space of all possible user configurationswhere the absorbing state is the orthogonal configuration

Π(U): mean time to absorption under uniform randomization ofcolliding users

Π(U) <∞ since finite state Markov chain

Anima Anandkumar (UCI) Learning and MAC 10/08/2010 25 / 32

Idea of Proof for Regret Under ρRAND Contd.,

Bound on E[M(n)] under learning

Good state: all users estimate order of top-U channels correctly,otherwise bad state

Bound separately collisions under good and bad states

Under run of good states: Π(U) mean no. of collisions

Run of bad states is O(log n)

Anima Anandkumar (UCI) Learning and MAC 10/08/2010 26 / 32

Outline

1 Introduction

2 System Model

3 Recap of Bandit Results

4 Proposed Algorithms & Lower BoundLearning Under Pre-allocationLearning with Random AllocationUnknown No. of Cognitive UsersLower Bound on Distributed Learning

5 Simulation Results

6 Conclusion

Anima Anandkumar (UCI) Learning and MAC 10/08/2010 27 / 32

Learning with Unknown No. of UsersNo. of cognitive users U fixed but unknown to policy

Horizon length n known to users

Regret bounds as n→∞

ρEST: Joint learning of channel order and no. of users

Maintain an estimate of no. of users U

Execute ρRAND(n; U) based on g-statistic assuming U no. of users

Update U based on feedback (no. of collisions)

Update Rule for U under ρEST

Fixed threshold functions ξ(n; k) for n = 1, 2, . . . and k = 1, . . . C.

Slow start: Initialize U ← 1.

If no. of collisions in top-U channels exceeds threshold ξ(n; U), then

U ← U + 1

Anima Anandkumar (UCI) Learning and MAC 10/08/2010 28 / 32

Learning with Unknown No. of Users (Contd.)

Theorem

For any class of threshold functions ξ(n; k) satisfying

limn→∞

ξ(n; k)

log n=∞, ∀k > 1,

the regret under ρEST policy satisfies

lim supn→∞

R(n;µ, U, ρEST)

ξ∗(n;U)<∞,

where

ξ∗(n;U) := maxk=1,...,U

ξ(n; k).

Regret is asymptotically (slightly more than) logarithmic under ρEST

Anima Anandkumar (UCI) Learning and MAC 10/08/2010 29 / 32

Proof Steps

Overestimation: Probability that U ≤ U

Conditional regret: Regret under U ≤ U

Overestimation Error

Number of collisions experienced when U = U is Θ(log n)

Applying threshold ξ(n) = ω(log n) ensures that U is not incremented

Conditional Regret

Time spent in U -worst channels is O(log n)

Number of collisions is O(ξ∗)

Anima Anandkumar (UCI) Learning and MAC 10/08/2010 30 / 32

Outline

1 Introduction

2 System Model

3 Recap of Bandit Results

4 Proposed Algorithms & Lower BoundLearning Under Pre-allocationLearning with Random AllocationUnknown No. of Cognitive UsersLower Bound on Distributed Learning

5 Simulation Results

6 Conclusion

Anima Anandkumar (UCI) Learning and MAC 10/08/2010 31 / 32

Lower Bound on Regret

Uniformly good policy ρ

A policy which enables users to ultimately settle down in orthogonal bestchannels under any channel availabilities µ: user j spends most of time inU -best channels and time spent in i ∈ U -worst channel satisfies

Eµ[n− Ti,j(n)] = o(nα), ∀α > 0,µ ∈ (0, 1)C , i ∈ U -worst.

Satisfied by ρPRE and ρRAND policies

Theorem (Lower Bound for Uniformly Good Policy (Liu & Zhao))

The sum regret satisfies

lim infn→∞

R(n;µ, U, ρ)

log n≥

i∈U -worst

U∑

j=1

∆(U∗, i)

D(µi, µj∗).

Order optimal regret under ρPRE and ρRAND policies

Anima Anandkumar (UCI) Learning and MAC 10/08/2010 32 / 32

Outline

1 Introduction

2 System Model

3 Recap of Bandit Results

4 Proposed Algorithms & Lower BoundLearning Under Pre-allocationLearning with Random AllocationUnknown No. of Cognitive UsersLower Bound on Distributed Learning

5 Simulation Results

6 Conclusion

Anima Anandkumar (UCI) Learning and MAC 10/08/2010 33 / 32

Simulation Results

0 1 2 3 4 5

x 104

0

50

100

150

200

250

300

350

400

450

500

Random Allocation

Central Allocation

Distributed Lower Bnd

Centralized Lower Bnd

Normalized regret R(n)log n

vs. n slots.

U = 4 users, C = 9 channels.

1 2 3 4 5 60

50

100

150

200

250

300

350

400

Random Allocation

Central Allocation

Distributed Lower Bnd

Centralized Lower Bnd

Normalized regret R(n)log n

vs. U users.

C = 9 channels, n = 2500 slots.

Probability of Availability µ = [0.1, 0.2, . . . , 0.9].

Anima Anandkumar (UCI) Learning and MAC 10/08/2010 34 / 32

Simulation Results

1 2 3 40

50

100

150

200

250

300

No. of runs with top rank vs. user.

U = 4, C = 9, n = 2500 slots, ρRAND.

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

x 104

0

0.5

1

1.5

2

2.5

3

3.5x 10

4

β = 1000

β = 1200

β = 2000

Regret R(n) vs. n slots (varying β).

U = 4 users, C = 9 channels, ρPRE.

Probability of Availability µ = [0.1, 0.2, . . . , 0.9].

Anima Anandkumar (UCI) Learning and MAC 10/08/2010 35 / 32

Outline

1 Introduction

2 System Model

3 Recap of Bandit Results

4 Proposed Algorithms & Lower BoundLearning Under Pre-allocationLearning with Random AllocationUnknown No. of Cognitive UsersLower Bound on Distributed Learning

5 Simulation Results

6 Conclusion

Anima Anandkumar (UCI) Learning and MAC 10/08/2010 36 / 32

Conclusion

Summary

Considered maximizing total throughput of cognitive users underunknown channel availabilities and no coordination

Proposed two algorithms which achieve order optimalityI ρPRE policy works under pre-allocated ranksI ρRAND policy does not require prior information

Proposed ρEST policy when no. of users is unknownI ρEST has asymptotically (slightly more than) logarithmic regret

A. Anandkumar, N. Michael, and A.K. Tang, “Opportunistic Spectrum Access with Multiple

Users: Learning under Competition” in Proc. of IEEE INFOCOM, (San Deigo, USA), Mar.

2010.

A. Anandkumar, N. Michael, A.K. Tang, and A. Swami, “Distributed Learning and Allocation of

Cognitive Users with Logarithmic Regret”, available on website.

Anima Anandkumar (UCI) Learning and MAC 10/08/2010 37 / 32