packet scheduling/arbitration in virtual output queues and others

134
1 CSIT560 by M. Hamdi Packet Packet Scheduling/Arbitrati Scheduling/Arbitrati on in Virtual Output on in Virtual Output Queues Queues and Others and Others

Upload: vidal

Post on 20-Mar-2016

35 views

Category:

Documents


2 download

DESCRIPTION

Packet Scheduling/Arbitration in Virtual Output Queues and Others. Key Characteristics in Designing Internet Switches and Routers. Scalability in terms of line rates Scalability in terms of number of interfaces (port numbers). Switch/Router Architecture Comparison. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Packet Scheduling/Arbitration in Virtual Output Queues and Others

1CSIT560 by M. Hamdi

Packet Packet Scheduling/Arbitration in Scheduling/Arbitration in

Virtual Output QueuesVirtual Output Queuesand Othersand Others

Page 2: Packet Scheduling/Arbitration in Virtual Output Queues and Others

2CSIT560 by M. Hamdi

Key Characteristics in Designing Key Characteristics in Designing Internet Switches and RoutersInternet Switches and Routers

1.1. Scalability in terms of line rates2. Scalability in terms of number of

interfaces (port numbers)

Page 3: Packet Scheduling/Arbitration in Virtual Output Queues and Others

3CSIT560 by M. Hamdi

Switch/Router Architecture Comparison

http://www.lightreading.com/document.asp?doc_id=47959

Page 4: Packet Scheduling/Arbitration in Virtual Output Queues and Others

4CSIT560 by M. Hamdi

Head-of-Line Blocking

Blocked!

Blocked!

Page 5: Packet Scheduling/Arbitration in Virtual Output Queues and Others

5CSIT560 by M. Hamdi

Page 6: Packet Scheduling/Arbitration in Virtual Output Queues and Others

6CSIT560 by M. Hamdi

Page 7: Packet Scheduling/Arbitration in Virtual Output Queues and Others

7CSIT560 by M. Hamdi

Crossbar Switches: Virtual Output Queues

• Virtual Output Queues: – At each input port, there are N queues – each associated

with an output port– Only one packet can go from an input port at a time– Only one packet can be received by an output port at a

time

• It retains the scalability of FIFO input-queued switches (no memory bandwidth problem)

• It eliminates the HoL problem with FIFO input Queues

Page 8: Packet Scheduling/Arbitration in Virtual Output Queues and Others

8CSIT560 by M. Hamdi

Virtual Output Queues

Page 9: Packet Scheduling/Arbitration in Virtual Output Queues and Others

9CSIT560 by M. Hamdi

SchedulerVOQs

VOQs: How Packets Move

Page 10: Packet Scheduling/Arbitration in Virtual Output Queues and Others

10CSIT560 by M. Hamdi

Crossbar Scheduler in VOQ Architecture

Scheduler

Memory b/w=2R

Can be quite complex!

Page 11: Packet Scheduling/Arbitration in Virtual Output Queues and Others

11CSIT560 by M. Hamdi

Question: do more lanes help?

• Answer: it depends on the scheduling

Head of Line BlockingVOQs with Bad SchedulingGood Scheduling? Ayalon: depends on traffic matrix…

Page 12: Packet Scheduling/Arbitration in Virtual Output Queues and Others

12CSIT560 by M. Hamdi

Crossbar Scheduler in VOQ Architecture

Which packetsI can send during each configuration

of the crossbar

Page 13: Packet Scheduling/Arbitration in Virtual Output Queues and Others

13CSIT560 by M. Hamdi

PortProcessor

opticsLCS Protocol

optics

PortProcessor

opticsLCS Protocol

optics

Crossbar

Switch core architecturePort #1

Scheduler(Like the

Processor of A Computer)

Request

Grant/Credit

Cell Data

Port #256

Page 14: Packet Scheduling/Arbitration in Virtual Output Queues and Others

14CSIT560 by M. Hamdi

Basic Switch Model

A1(n)

S(n)

N NLNN(n)

A1N(n)

A11(n)L11(n)

1 1

AN(n)ANN(n)

AN1(n)

D1(n)

DN(n)

Page 15: Packet Scheduling/Arbitration in Virtual Output Queues and Others

15CSIT560 by M. Hamdi

Some definitions

matrix. npermutatio a is and :where :matrix Service 2.

".admissible" is traffic the say we Ifwhere

:matrix Traffic 1.

SssS

nAE

ijij

jij

iij

ijijij

1,0],[

1,1

)]([:,

3. Queue occupancies:

Occupancy

L11(n) LNN(n)

Page 16: Packet Scheduling/Arbitration in Virtual Output Queues and Others

16CSIT560 by M. Hamdi

Some possible performance goals

?metrics...Other .5.4

,)( 3.t" throughpu"100% 2.

onconservati Work 1.

ndedDelayisbou

nCnLij When

traffic is admissible

Page 17: Packet Scheduling/Arbitration in Virtual Output Queues and Others

17CSIT560 by M. Hamdi

VOQ Switch Scheduling

A 1BCDEF

23456

• The VOQ switch scheduling can be represented by a bipartite graph– The left-hand side nodes of the bipartite graph are the input ports– The right-hand side nodes of the bipartite graph are the output ports– The edges between the nodes are requests for packet transmission

between input ports and output ports.

Page 18: Packet Scheduling/Arbitration in Virtual Output Queues and Others

18CSIT560 by M. Hamdi

Maximum size bipartite match• Intuition: maximizes instantaneous throughput

L11(n)>0

LN1(n)>0

“Request” Graph Bipartite Match

MaximumSize Match

Page 19: Packet Scheduling/Arbitration in Virtual Output Queues and Others

19CSIT560 by M. Hamdi

Network flows and bipartite matching

Finding a maximum size bipartite matching is equivalent to solving a network flow problem with capacities and flows of size “1”.

A 1

Sources

Sinkt

BCDEF

23456

Page 20: Packet Scheduling/Arbitration in Virtual Output Queues and Others

20CSIT560 by M. Hamdi

Network Flows

Sources

Sinkt

a c

b d

10

10

101

11

10

10

• Let G=[V,E] be a directed graph with capacity cap(v,w) on edge [v,w].

• A flow is an (integer) function, f, that is chosen for each edge so that f(v,w) <= cap(v,w).

• We wish to maximize the flow allocation.

Page 21: Packet Scheduling/Arbitration in Virtual Output Queues and Others

21CSIT560 by M. Hamdi

A maximum network flow exampleBy inspection

Sources

Sinkt

a c

b d

10

10

101

11

10

10

Step 1: Source

sSink

ta c

b d10, 1010

10, 101

11

10

10, 10

Flow is of size 10

Page 22: Packet Scheduling/Arbitration in Virtual Output Queues and Others

22CSIT560 by M. Hamdi

A maximum network flow example

Sources

Sinkt

a c

b d10, 1010, 1

10, 101

11, 1 10, 1

10, 10Step 2:

Flow is of size 10+1 = 11

Sources

Sinkt

a c

b d10, 1010, 2

10, 91,1

1,11, 1 10, 2

10, 10

Maximum flow:

Flow is of size 10+2 = 12

Not obvious

Page 23: Packet Scheduling/Arbitration in Virtual Output Queues and Others

23CSIT560 by M. Hamdi

Ford-Fulkerson method of augmenting paths

1. Set f(v,w) = -f(w,v) on all edges.

2. Define a Residual Graph, R, in which res(v,w) = cap(v,w) – f(v,w)

3. Find paths from s to t for which there is positive residue.

4. Increase the flow along the paths to augment them by the minimum residue along the path.

5. Keep augmenting paths until there are no more to augment.

Page 24: Packet Scheduling/Arbitration in Virtual Output Queues and Others

24CSIT560 by M. Hamdi

Example of Residual Graph

s t

a c

b d10, 1010

10, 101

11

10

10, 10

Flow is of size 10

t

a c

b d

10

10

101

11

10

10s

res(v,w) = cap(v,w) – f(v,w) Residual Graph, R

Augmenting path

Page 25: Packet Scheduling/Arbitration in Virtual Output Queues and Others

25CSIT560 by M. Hamdi

Example of Residual Graph

s t

a c

b d10, 1010

10, 101

11

10

10, 10

Flow is of size 10

t

a c

b d

10

10

101

11

10

10s

res(v,w) = cap(v,w) – f(v,w) Residual Graph, R

Augmenting path

Page 26: Packet Scheduling/Arbitration in Virtual Output Queues and Others

26CSIT560 by M. Hamdi

Example of Residual Graph

s ta c

b d10, 1010, 1

10, 101

11, 1 10, 1

10, 10Step 2:

Flow is of size 10+1 = 11

s ta c

b d

10

1

101

11

1

10Residual Graph

9 9Augmenting pathAugmenting path

Page 27: Packet Scheduling/Arbitration in Virtual Output Queues and Others

27CSIT560 by M. Hamdi

Example of Residual Graph

s ta c

b d10, 1010, 2

10, 91, 1

1, 11, 1 10, 2

10, 10Step 3:

Flow is of size 10+2 = 12

s ta c

b d

10

2

101

11

2

10Residual Graph

8 8

Page 28: Packet Scheduling/Arbitration in Virtual Output Queues and Others

28CSIT560 by M. Hamdi

An other Example: Ford-Fulkerson method

s

16

13

10 4 97

1220

411

a b

c d

t

f=0G

s

16

13

10 4 97

1220

411

a b

c d

t

Gf

find augmenting path p

s

16

4/13

10 4 97

1220

4/44/11

a b

c d

t s

16

410 4 9

7

1220

4

7

a b

c d

t

49

f=4

Page 29: Packet Scheduling/Arbitration in Virtual Output Queues and Others

29CSIT560 by M. Hamdi

f=4G Gf

find augmenting path p

s

16

4/13

10 4 97

1220

4/44/11

a b

c d

t s

16

410 4 9

7

1220

4

7

a b

c d

t

49

f=4+12

s

12/16

4/13

10 4 97

12/1212/20

4/44/11

a b

c d

t s12

410 4 9

7

128

4

7

a b

c d

t

49

4

12

An other Example: Ford-Fulkerson method

Page 30: Packet Scheduling/Arbitration in Virtual Output Queues and Others

30CSIT560 by M. Hamdi

f=16G Gf

find augmenting path p

s

12/16

4/13

10 4 97

12/1212/20

4/44/11

a b

c d

t s12

410 4 9

7

128

4

7

a b

c d

t

49

4

12

f=16+7

s

12/16

11/13

10 4 97/7

12/1219/20

4/411/11

a b

c d

t s12

1110 4 9

7

121

4

11

a b

c d

t

2

4

19

An other Example: Ford-Fulkerson method

Page 31: Packet Scheduling/Arbitration in Virtual Output Queues and Others

31CSIT560 by M. Hamdi

f=23G Gf

find augmenting path p

s

12/16

11/13

10 4 97/7

12/1219/20

4/411/11

a b

c d

t s12

1110 4 9

7

121

4

11

a b

c d

t

2

4

19

No more augmenting path

Maximum Flow is 23

An other Example: Ford-Fulkerson method

Page 32: Packet Scheduling/Arbitration in Virtual Output Queues and Others

32CSIT560 by M. Hamdi

An example for Flow: Obvious solutionS

T

10 10

10

1010

9

99

9

Input graph G

S

T

10 10

10

1010

9

99

9

Residual Graph Gr

S

T

Flow graph Gf

S

T

0 10

0

010

9

99

9

S

T

10

10

10

S

T

10

10

9

99

9

S

T

10

10

10

Total flow = 10, Sub-optimal solution!

Page 33: Packet Scheduling/Arbitration in Virtual Output Queues and Others

33CSIT560 by M. Hamdi

Flow algorithm – Optimal version

S

T

10 10

10

1010

9

99

9

Input graph G

S

T

10 10

10

1010

9

99

9

Residual Graph Gr

S

T

Flow graph Gf

S

T

10 10

10

1010

9

99

9

S

T

S

T

0 10

0

010

9

99

9

S

T

10

10

10

10

10

10

S

T

10

10

9

99

9

S

T

10

10

10

10

10

10

S

T

10

10

9

99

9

S

T

10

10

10

10

10

10

Total flow = 10 + 9 = 19 units!

S

T

1

1

S

T

10

1

10

10

1

109

9

9 9

9

9

9

99

9

9

9

9

Page 34: Packet Scheduling/Arbitration in Virtual Output Queues and Others

34CSIT560 by M. Hamdi

Complexity of network flow problems• In general, it is possible to find a solution by

considering at most V.E paths, by picking shortest augmenting path first.

• There are many variations, such as picking most augmenting path first.

• The complexity of the algorithm is less when the graph is bipartite

• There are techniques other than the Ford-Fulkerson method.

Page 35: Packet Scheduling/Arbitration in Virtual Output Queues and Others

35CSIT560 by M. Hamdi

Ford - Fulkerson Algorithm – 1

1 2 3 4 5 6

sink

a b c d e f

source

Network flows and bipartite matching

Finding a maximum size bipartite matching is equivalent to solving a network flow problem with capacities and flows of size “1”.

Page 36: Packet Scheduling/Arbitration in Virtual Output Queues and Others

36CSIT560 by M. Hamdi

Ford - Fulkerson Algorithm – 2

1 2 3 4 5 6

sink

a b c d e f

source

Increasing the flow by 1.

Page 37: Packet Scheduling/Arbitration in Virtual Output Queues and Others

37CSIT560 by M. Hamdi

Ford - Fulkerson Algorithm – 3

1 2 3 4 5 6

sink

a b c d e f

source

Increasing the flow by 1.

Page 38: Packet Scheduling/Arbitration in Virtual Output Queues and Others

38CSIT560 by M. Hamdi

Ford - Fulkerson Algorithm – 4

1 2 3 4 5 6

sink

a b c d e f

source

Increasing the flow by 1.

Page 39: Packet Scheduling/Arbitration in Virtual Output Queues and Others

39CSIT560 by M. Hamdi

Ford - Fulkerson Algorithm – 5

1 2 3 4 5 6

sink

a b c d e f

source

Increasing the flow by 1.

Page 40: Packet Scheduling/Arbitration in Virtual Output Queues and Others

40CSIT560 by M. Hamdi

Ford - Fulkerson Algorithm – 6

1 2 3 4 5 6

sink

a b c d e f

source

Increasing the flow by 1.

Page 41: Packet Scheduling/Arbitration in Virtual Output Queues and Others

41CSIT560 by M. Hamdi

Ford - Fulkerson Algorithm – 7

1 2 3 4 5 6

sink

a b c d e f

source

Augmenting flow along the augmenting path.

Page 42: Packet Scheduling/Arbitration in Virtual Output Queues and Others

42CSIT560 by M. Hamdi

Ford - Fulkerson Algorithm – 8

1 2 3 4 5 6

sink

a b c d e f

source

Maximum flow found!Thus maximum matching found.

Page 43: Packet Scheduling/Arbitration in Virtual Output Queues and Others

43CSIT560 by M. Hamdi

Complexity of Maximum Matchings• Maximum Size/Cardinality Matchings:

– Algorithm by Dinic O(N5/2)

• Maximum Weight Matchings– Algorithm by Kuhn O(N3logN)

• ftp://dimacs.rutgers.edu/pub/netflow/matching/(contains code for maximum size/weighting algorithms)

• In general:– Hard to implement in hardware– Slooooow.

Page 44: Packet Scheduling/Arbitration in Virtual Output Queues and Others

44CSIT560 by M. Hamdi

Maximum size bipartite match• Intuition: maximizes instantaneous throughput

• for uniform traffic.

L11(n)>0

LN1(n)>0

“Request” Graph Bipartite Match

MaximumSize Match

[ ( )]ijE L n

Page 45: Packet Scheduling/Arbitration in Virtual Output Queues and Others

45CSIT560 by M. Hamdi

Why doesn’t maximizing instantaneous throughput give 100% throughput for non-uniform traffic?

2/1

2/1

2/1

32

21

1211

Three possiblematches, S(n):

100%). t(throughpu stable not is switch 0.0358 if so And But

most at is served is 1 input which at rate total The

. w.p. serviced is 1 Input ) w.p.( arrivals have both and and , time at that Assume

.)21(31121

.)21(311

)21(11)21(32

32)21(

)()(0)(0)(

21

2

22

2

32211211

-δ// - -λ

//

/-//

/-δ/

nQnQ n, L nn, L

Page 46: Packet Scheduling/Arbitration in Virtual Output Queues and Others

46CSIT560 by M. Hamdi

Maximum weight matching

A1(n)

N NLNN(n)

A1N(n)

A11(n)L11(n)

1 1

AN(n)

ANN(n)

AN1(n)

D1(n)

DN(n)

L11(n)

LN1(n)

“Request” Graph Bipartite Match

S*(n)

MaximumWeight Match

*

( )( ) arg max( ( ) ( ))T

S nS n L n S n

•Weight could be Weight could be length of queue or length of queue or age of packetage of packet

• Achieves 100% Achieves 100% throughput under throughput under all traffic patternsall traffic patterns

Page 47: Packet Scheduling/Arbitration in Virtual Output Queues and Others

47CSIT560 by M. Hamdi

Packet Scheduling/Arbitration in Virtual Output Queues:

Maximal Matching Algorithms

Page 48: Packet Scheduling/Arbitration in Virtual Output Queues and Others

48CSIT560 by M. Hamdi

1

2

3

4

1

2

3

4

1

2

3

4

1

2

3

4

Maximum size matching

Maximum weight matching

1

2

3

4

1

2

3

4

8

6

4

2

1

3

1

1

2

3

4

1

2

3

4

8

6

4

Maximum Matching in VOQ Architecture

Page 49: Packet Scheduling/Arbitration in Virtual Output Queues and Others

49CSIT560 by M. Hamdi

Complexity of Maximum Matchings• Maximum Size/Cardinality Matchings:

– Algorithm by Dinic O(N5/2)

• Maximum Weight Matchings– Algorithm by Kuhn O(N3logN)

• In general:– Hard to implement in hardware

– Slooooow.

Page 50: Packet Scheduling/Arbitration in Virtual Output Queues and Others

50CSIT560 by M. Hamdi

Maximal Matching• A maximal matching is a matching in which each

edge is added one at a time, and is not later removed from the matching.

• i.e., No augmenting paths allowed (they remove edges added earlier) – like by inspection.

• No input and output are left unnecessarily idle.

Page 51: Packet Scheduling/Arbitration in Virtual Output Queues and Others

51CSIT560 by M. Hamdi

Example of Maximal Size Matching

A 1BCDEF

23456

A 1BCDEF

23456

Maximal Matching Maximum Matching

Page 52: Packet Scheduling/Arbitration in Virtual Output Queues and Others

52CSIT560 by M. Hamdi

Comments on Maximal Matchings• In general, maximal matching is much simpler to

implement, and has a much faster running time.• A maximal size matching is at least half the size

of a maximum size matching.• A maximal weight matching is defined in the

obvious way.• A maximal weight matching is at least half the

size of a maximum weight matching.

Page 53: Packet Scheduling/Arbitration in Virtual Output Queues and Others

53CSIT560 by M. Hamdi

PIM Maximal Size Matching Algorithm: Performance and Properties

• It is among the very first practical schedulers proposed for VOQ architectures (used by DEC).

• It is based on having arbiters at the inputs and outputs

• It iterates the following steps until no more requests can be accepted (or for a given number of iterations):

1. Request: Each unmatched input sends a request to every output for which it has a queued cell

2. Grant (outputs): If an unmatched output receives any request, it grants one by randomly selecting a request uniformly over all requests.

3. Accept (inputs): If an unmatched input receives a grant, it accepts one by selecting an output randomly among those granted to this input.

Page 54: Packet Scheduling/Arbitration in Virtual Output Queues and Others

54CSIT560 by M. Hamdi

Stat

e of

Inpu

t Que

ues (

N2 b

its)

1

2

N

1

2

N

Dec

isio

n R

egis

ter

Grant Arbiters Request Arbiters

Implementation of the parallel maximal matching algorithms

Page 55: Packet Scheduling/Arbitration in Virtual Output Queues and Others

55CSIT560 by M. Hamdi

Implementation of the parallel maximal matching algorithms

(another similar way)Request

BufferGrant

ArbiterAcceptArbiter

New Request Decision

Request

BufferGrant

ArbiterAcceptArbiter

New Request Decision

Request

BufferGrant

ArbiterAcceptArbiter

New Request Decision

Page 56: Packet Scheduling/Arbitration in Virtual Output Queues and Others

56CSIT560 by M. Hamdi

1

2

3

4

1

2

3

4

Step 1: Request

1

2

3

4

1

2

3

4

Step 2: Grant

1

2

3

4

1

2

3

4Step 3: Accept

PIM: 1st IterationRandom

selection

Random selection

PIM Maximum Size Matching Algorithm: Performance and

Properties

Page 57: Packet Scheduling/Arbitration in Virtual Output Queues and Others

57CSIT560 by M. Hamdi

1

2

3

4

1

2

3

4Step 3: Accept

PIM: 2nd Iteration

1

2

3

4

1

2

3

4

Step 1: Request

Step 2: Grant

1

2

3

4

1

2

3

4

PIM Maximum Size Matching Algorithm: Performance and

Properties

Page 58: Packet Scheduling/Arbitration in Virtual Output Queues and Others

58CSIT560 by M. Hamdi

Traffic Types to evaluate Algorithms

xxx

xxxx

11

11

xxxxxxxxxxxxxxxx

xxxxxxxxxxxxxxxx

2222

Uniform trafficUniform traffic Unbalanced trafficUnbalanced traffic

Hotpot trafficHotpot traffic

Page 59: Packet Scheduling/Arbitration in Virtual Output Queues and Others

59CSIT560 by M. Hamdi

Parallel Iterative Matching

PIM with a single iteration

Page 60: Packet Scheduling/Arbitration in Virtual Output Queues and Others

60CSIT560 by M. Hamdi

Parallel Iterative Matching

PIM with 4 iterations

Page 61: Packet Scheduling/Arbitration in Virtual Output Queues and Others

61CSIT560 by M. Hamdi

Parallel Iterative MatchingAnalytical Results

E C Nlog

E Ui N2

4i------- C # of iterations required to resolve connections=N # of ports =

U i # of unresolved connections after iteration i=

Number of iterations to converge:

Page 62: Packet Scheduling/Arbitration in Virtual Output Queues and Others

62CSIT560 by M. Hamdi

PIM Maximum Size Matching Algorithm: Performance and Properties

• It is a fair algorithm – servicing inputs

• Can have 100% throughput under uniform traffic

• It converges in logN iterations to a maximal size matching

• It has a very poor performance (63% throughput) with 1 iteration – because of its inability to desynchronize the output pointers

• It is not easy to build random arbiters in hardware• The best iterative maximal size matching algorithm takes O(N2logN)

serial or O(log N) parallel time steps.

• If the number of iterations is constant, then it can be implemented in constant time (that is why it is practical) – however the hardware design is not trivial.

Page 63: Packet Scheduling/Arbitration in Virtual Output Queues and Others

63CSIT560 by M. Hamdi

RRM Maximum Size Matching Algorithm: Performance and Properties

• Round Robin Matching (RRM) is easier to implement that PIM (in terms of designing the I/O arbiters).

• The pointers of the arbiters move in straightforward way

• It iterates the following steps until no more requests can be accepted (or for a given number of iterations):

• Request. Each input sends a request to every output for which it has a queued cell.

• Grant. If an output receives any requests, it chooses the one that appears next in a fixed, round-robin schedule starting from the highest priority element. The output notifies each input whether or not its request was granted. The pointer gi to the highest priority element of the round-robin schedule is incremented (modulo N) to one location beyond the granted input. If no request is received, the pointer stays unchanged.

Page 64: Packet Scheduling/Arbitration in Virtual Output Queues and Others

64CSIT560 by M. Hamdi

RRM Maximum Size Matching Algorithm: Performance and Properties

• Accept. If an input receives a grant, it accepts the one that appears next in a fixed, round-robin schedule starting from the highest priority element. The pointer ai to the highest priority element of the round-robin schedule is incremented (modulo N) to one location beyond the accepted output. If no grant is received, the pointer stays unchanged.

Page 65: Packet Scheduling/Arbitration in Virtual Output Queues and Others

65CSIT560 by M. Hamdi

RRM Maximal Matching Algorithm (1)

0

1

2

3

0

1

2

3

Step 1: Request

Page 66: Packet Scheduling/Arbitration in Virtual Output Queues and Others

66CSIT560 by M. Hamdi

RRM Maximal Matching Algorithm (2)

0

1

2

3

0

1

2

3

Step 2: Grant3 02 1

3 02 1

Page 67: Packet Scheduling/Arbitration in Virtual Output Queues and Others

67CSIT560 by M. Hamdi

RRM Maximal Matching Algorithm (2)

0

1

2

3

0

1

2

3

Step 2: Grant3 02 1

3 02 1

Page 68: Packet Scheduling/Arbitration in Virtual Output Queues and Others

68CSIT560 by M. Hamdi

RRM Maximal Matching Algorithm (2)

0

1

2

3

0

1

2

3

Step 2: Grant3 02 1

3 02 1

Page 69: Packet Scheduling/Arbitration in Virtual Output Queues and Others

69CSIT560 by M. Hamdi

RRM Maximal Matching Algorithm (2)

0

1

2

3

0

1

2

3

Step 2: Grant3 02 1

3 02 1

Page 70: Packet Scheduling/Arbitration in Virtual Output Queues and Others

70CSIT560 by M. Hamdi

RRM Maximal Matching Algorithm (3)

0 31 2

0

1

2

3

0

1

2

3

Step 3: Accept3 02 1

3 02 1

Page 71: Packet Scheduling/Arbitration in Virtual Output Queues and Others

71CSIT560 by M. Hamdi

RRM Maximal Matching Algorithm (3)

0 31 2

0

1

2

3

0

1

2

3

Step 3: Accept3 02 1

3 02 1

Page 72: Packet Scheduling/Arbitration in Virtual Output Queues and Others

72CSIT560 by M. Hamdi

RRM Maximal Matching Algorithm (3)

0 31 2

0

1

2

3

0

1

2

3

Step 3: Accept3 02 1

3 02 1

Page 73: Packet Scheduling/Arbitration in Virtual Output Queues and Others

73CSIT560 by M. Hamdi

Poor performance of RRM Maximal Matching Algorithm

0

1

0

1

00

11

00

11

50% Throughput50% Throughput

00

11

00

11

....

00

11

00

11

....

Page 74: Packet Scheduling/Arbitration in Virtual Output Queues and Others

74CSIT560 by M. Hamdi

iSLIP Maximum Size Matching Algorithm: Performance and Properties

• It is a scheduler used in most VOQ switches (e.g., Cisco).• It is exactly like RRM algorithm with the following change:• Grant. If an output receives any requests, it chooses the one that

appears next in a fixed, round-robin schedule starting from the highest priority element. The output notifies each input whether or not its request was granted. The pointer gi to the highest priority element of the round-robin schedule is incremented (modulo N) to one location beyond the granted input if and only if the grant is accepted in (Accept phase) .

Page 75: Packet Scheduling/Arbitration in Virtual Output Queues and Others

75CSIT560 by M. Hamdi

1

2

3

4

1

2

3

4

Step 2: Grant

1

2

3

4

1

2

3

4Step 3: Accept

iSlip: 1st Iteration

4 13 2

4 13 2

1

2

3

4

1

2

3

4

Step 1: Request

1 42 3

4 13 2

Original pointerSelected oneUpdated pointer

iSLIP Maximum Size Matching Algorithm

Page 76: Packet Scheduling/Arbitration in Virtual Output Queues and Others

76CSIT560 by M. Hamdi

1

2

3

4

1

2

3

4

Step 2: Grant

1

2

3

4

1

2

3

4Step 3: Accept

iSlip: 2nd Iteration

4 13 2

1

2

3

4

1

2

3

4

Step 1: Request

1 42 3

4 13 2

No change

Original pointerSelected oneUpdated pointer

iSLIP Maximum Size Matching Algorithm

Page 77: Packet Scheduling/Arbitration in Virtual Output Queues and Others

77CSIT560 by M. Hamdi

Simple Iterative Algorithms: iSlip

0

1

2

3

0

1

2

3

Step 1: Request

Page 78: Packet Scheduling/Arbitration in Virtual Output Queues and Others

78CSIT560 by M. Hamdi

Simple Iterative Algorithms: iSlip

0

1

2

3

0

1

2

3

Step 2: Grant3 02 1

3 02 1

Page 79: Packet Scheduling/Arbitration in Virtual Output Queues and Others

79CSIT560 by M. Hamdi

0

1

2

3

0

1

2

3

Step 2: Grant3 02 1

3 02 1

Simple Iterative Algorithms: iSlip

Page 80: Packet Scheduling/Arbitration in Virtual Output Queues and Others

80CSIT560 by M. Hamdi

0 31 2

0

1

2

3

0

1

2

3

Step 3: Accept3 02 1

3 02 1

Simple Iterative Algorithms: iSlip

Page 81: Packet Scheduling/Arbitration in Virtual Output Queues and Others

81CSIT560 by M. Hamdi

0 31 2

0

1

2

3

0

1

2

3

Step 3: Accept3 02 1

3 02 1

Simple Iterative Algorithms: iSlip

Page 82: Packet Scheduling/Arbitration in Virtual Output Queues and Others

82CSIT560 by M. Hamdi

Simple Iterative Algorithms: iSlip

0 31 2

0

1

2

3

0

1

2

3

Step 3: Accept3 02 1

3 02 1

Page 83: Packet Scheduling/Arbitration in Virtual Output Queues and Others

83CSIT560 by M. Hamdi

Simple Iterative Algorithms: iSlip

0 31 2

0

1

2

3

0

1

2

3

Step 3: Accept3 02 1

3 02 1

Page 84: Packet Scheduling/Arbitration in Virtual Output Queues and Others

84CSIT560 by M. Hamdi

Simple Iterative Algorithms: iSlip

0 31 2

0

1

2

3

0

1

2

3

Step 3: Accept3 02 1

3 02 1

Page 85: Packet Scheduling/Arbitration in Virtual Output Queues and Others

85CSIT560 by M. Hamdi

iSLIP Implementation

Grant

Grant

Grant

Accept

Accept

Accept

1

2

N

1

2

N

State

N

N

N

Decision

log2N

log2N

log2N

ProgrammablePriority Encoder

Page 86: Packet Scheduling/Arbitration in Virtual Output Queues and Others

86CSIT560 by M. Hamdi

Hardware Design

256 bit PriorityEncoder

Layout Size 292μ m x 273μ mPost LayoutSimulation delay

2.7 ns

Layout of the 256 bits Priority Encoder

Page 87: Packet Scheduling/Arbitration in Virtual Output Queues and Others

87CSIT560 by M. Hamdi

Hardware Design

`

P.E. P.E.

Filter

MU

X

Flipping

Pointer & Mask

Latch

Flipping256 bit PriorityEncoder

Layout Size 1016μ m x 985μ mPost LayoutSimulation delay(filter to the latch)

2.3ns

Post LayoutSimulation delay(P.E. to the flipping)

4.06 ns

Layout of 256 bits grant arbiter

Page 88: Packet Scheduling/Arbitration in Virtual Output Queues and Others

88CSIT560 by M. Hamdi

FIRM Maximum Size Matching Algorithm: Performance and Properties

• It is exactly like iSLIP with a very small – yet significant modification.

• Grant (outputs): If an unmatched output receives a request, it grants the one that appears next in a fixed, round-robin schedule starting from the highest priority element. The output notifies each input whether or not its request is granted. The pointer to the highest priority element of the round-robin schedule is incremented beyond the granted input. If input does not accept the pointer is set at the granted one.

Page 89: Packet Scheduling/Arbitration in Virtual Output Queues and Others

89CSIT560 by M. Hamdi

0

1

2

3

0

1

2

3

Step 3: Accept3 02 1

3 02 1

Simple Iterative Algorithms: FIRM

Page 90: Packet Scheduling/Arbitration in Virtual Output Queues and Others

90CSIT560 by M. Hamdi

Pointer Synchronization• Why this is good: this small change prevents the output

arbiters from moving in lock-step (being synchronized – pointing to the same input) leading to a dramatic improvement in performance.

• If several outputs grant the same input, no matter how this input chooses, only one match can be made, and the other outputs will be idle.

• To get as many matches as possible, it's better that each output grants a different input.

• Since each output will select the highest priority input if a request is received from this input, it's better to keep the output pointers desynchronized (pointing to different locations).

Page 91: Packet Scheduling/Arbitration in Virtual Output Queues and Others

91CSIT560 by M. Hamdi

iSLIP Maximal Matching Algorithm

0

1

0

1

00

11

00

11

100% Throughput100% Throughput

00

11

00

11

....

00

00

11

00

....

Page 92: Packet Scheduling/Arbitration in Virtual Output Queues and Others

92CSIT560 by M. Hamdi

Pointer Synchronization: Differences between RRM, iSlip & FIRM

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

5

10

15

20

25

30

35

Normalized load

Avg

num

ber o

f syn

chro

nize

d ou

tput

sch

edul

ers

32x32 switch under uniform traffic

RRM iSlipFIRM

Page 93: Packet Scheduling/Arbitration in Virtual Output Queues and Others

93CSIT560 by M. Hamdi

Differences between RRM, iSlip & FIRM

RRM iSlip FIRM

Input No grant unchanged

Granted one location beyond the accepted one

OutputNo request unchanged

Grant accepted

one location beyond the granted one

Grant not accepted

one location beyond the previously granted one

unchanged the granted one

Page 94: Packet Scheduling/Arbitration in Virtual Output Queues and Others

94CSIT560 by M. Hamdi

General remarks• Since all of these algorithms try to approximate

maximum size matching, they can be unstable under non-uniform traffic

• They can achieve 100% throughput under uniform traffic

• Under a large number of iterations, their performance is similar

• They have similar implementation complexity

Page 95: Packet Scheduling/Arbitration in Virtual Output Queues and Others

95CSIT560 by M. Hamdi

Input QueueingLongest Queue First or

Oldest Cell First

1234

1234

1234

1234

10 1

1

1

1 10

Maximum weight

Weight Waiting Time 100%Queue Length { } =

Page 96: Packet Scheduling/Arbitration in Virtual Output Queues and Others

96CSIT560 by M. Hamdi

Input QueueingWhy is serving long/old queues better than serving

maximum number of queues?

• When traffic is uniformly distributed, servicing themaximum number of queues leads to 100% throughput.• When traffic is non-uniform, some queues become longer than others.• A good algorithm keeps the queue lengths matched, and

services a large number of queues.

VOQ #

Avg

Occ

upan

cy Uniform traffic

VOQ #

Avg

Occ

upan

cy

Non-uniform traffic

Page 97: Packet Scheduling/Arbitration in Virtual Output Queues and Others

97CSIT560 by M. Hamdi

Maximum/Maximal Weight Matching

• 100% throughput for admissible traffic (uniform or non-uniform)

• Maximum Weight Matching– OCF (Oldest Cell First): w=cell waiting time– LQF (Longest Queue First):w=input queue occupancy– LPF (Longest Port First):w=QL of the source port + Sum of

QL form the source port to the destination port

• Maximal Weight Matching (practical algorithms)– iOCF – iLQF – iLPF (comparators in the critical path of iLQF are removed )

Page 98: Packet Scheduling/Arbitration in Virtual Output Queues and Others

98CSIT560 by M. Hamdi

Maximal Weight Matching Algorithms: iLQF

• Request. Each unmatched input sends a request word of width bits to each output for which it has a queued cell, indicating the number of cells that it has queued to that output.

• Grant. If an unmatched output receives any requests, it chooses the largest valued request. Ties are broken randomly.

• Accept. If an unmatched input receives one or more grants, it accepts the one to which it made the largest valued request. Ties are broken randomly.

Page 99: Packet Scheduling/Arbitration in Virtual Output Queues and Others

99CSIT560 by M. Hamdi

Maximal Weight Matching Algotithms: iLQF

• The i-LQF algorithm has the following properties:

• Property 1. Independent of the number of iterations, the longest input queue is always served.

• Property 2. As with i-SLIP, the algorithm converges in at most logN iterations.

• Property 3. For an inadmissible offered load, an input queue may be starved.

Page 100: Packet Scheduling/Arbitration in Virtual Output Queues and Others

100CSIT560 by M. Hamdi

Maximal Weight Matching Algotithms: iOCF

• The i-OCF algorithm works in similar fashion to iLQF, and has the following properties:

• Property 1. Independent of the number of iterations, the cell that has been waiting the longest time in the input queues (it must at the head of the queue)

• Property 2. As with i-LQF, the algorithm converges in at most logN iterations.

• Property 3. No input queue can be starved indefinitely.

• Property 4. It is difficult to keep time stamps on the cells.

Page 101: Packet Scheduling/Arbitration in Virtual Output Queues and Others

101CSIT560 by M. Hamdi

iLQF - Implementation

Page 102: Packet Scheduling/Arbitration in Virtual Output Queues and Others

102CSIT560 by M. Hamdi

iLQF - ImplementationComplicated hardware

Page 103: Packet Scheduling/Arbitration in Virtual Output Queues and Others

103CSIT560 by M. Hamdi

Other research efforts• Packet-based arbitration• Exhaustive-based arbitration• Numerous other efforts

Page 104: Packet Scheduling/Arbitration in Virtual Output Queues and Others

104CSIT560 by M. Hamdi

Packet Scheduling/Arbitration in Virtual Output Queues:

Randomized Algorithmsand Others

Page 105: Packet Scheduling/Arbitration in Virtual Output Queues and Others

105CSIT560 by M. Hamdi

Input-Queued Packet Switch

Crossbar

Scheduler

inputs

outputs

1

N

1 N

.

.

.

.

. . . .

i,j

N,N

1,

1

Xi,j

(i i i,j < 1 ; j j i,j < 1)

Page 106: Packet Scheduling/Arbitration in Virtual Output Queues and Others

106CSIT560 by M. Hamdi

Bipartite Graph and Matrix

011

111

001inputs

outputs

1

2

3

321

Page 107: Packet Scheduling/Arbitration in Virtual Output Queues and Others

107CSIT560 by M. Hamdi

Stability of Scheduling

Definition:

Let Xi,j(t) be the number of packets queued at input i for output j at time-slot t.

Then an algorithm is stable iff:

)(, , tXE ji ji

Page 108: Packet Scheduling/Arbitration in Virtual Output Queues and Others

108CSIT560 by M. Hamdi

MotivationMotivation• Networking problems suffer from the “curse of

dimensionality”– algorithmic solutions do not scale well

• Typical causes– size: large number of users or large number of I/O– time: very high speeds of operation

• A good deterministic algorithm exists (Max Flow), but …– it needs state information, and “state” is too big– it “starts from scratch” in each iteration

Page 109: Packet Scheduling/Arbitration in Virtual Output Queues and Others

109CSIT560 by M. Hamdi

Randomization• Randomized algorithms have frequently been used in many

situations where the state space (e.g., different number of connections between input and output N!) is very large

• Randomized algorithms– are a powerful way of approximating the optimal solution– it is often possible to randomize deterministic algorithms – this simplifies the implementation while retaining a

(surprisingly) high level of performance

• The main idea is – to simplify the decision-making process– by basing decisions upon a small, randomly chosen sample of

the state – rather than upon the complete state

Page 110: Packet Scheduling/Arbitration in Virtual Output Queues and Others

110CSIT560 by M. Hamdi

Randomizing Iterative Schemes (e.g., iSLIP)

• Often, we want to perform some operation iteratively• Example: find the heaviest matching in a switch in every time slot• Since, in each time slot

– at most one packet can arrive at each input– and, at most one packet can depart from each output the size of the queues, or the “state” of the switch, doesn’t change by

much between successive time slots so, a matching that was heavy at time t will quite likely continue to be

heavy at time t+1

• This suggests that– knowing a heavy matching at time t should help in determining a heavy

matching at time t+1 there is no need to start from scratch in each time slot

Page 111: Packet Scheduling/Arbitration in Virtual Output Queues and Others

111CSIT560 by M. Hamdi

Summarizing Randomized Algorithms• Randomized algorithms can help simplify the

implementation– by reducing the amount of work in each iteration

• If the state of the system doesn’t change by much between iterations, then– we can reduce the work even further by carrying

information between iterations

• The big pay-off is that, even though it is an approximation, the performance of

a randomized scheme can be surprisingly good

Page 112: Packet Scheduling/Arbitration in Virtual Output Queues and Others

112CSIT560 by M. Hamdi

Randomized Scheduling Algorithms: Example

• Consider a 3 x 3 input-queued switch – input traffic: is Bernoulli IID and λij = α/3 for all i, j, and α <

1

– This is admissible– note: there are a total of 6 (= 3!) possible service matrices

111111111

3/3/3/3/3/3/3/3/3/3/

100010001

010100001

100001010

001100010

010001100

001010100

Page 113: Packet Scheduling/Arbitration in Virtual Output Queues and Others

113CSIT560 by M. Hamdi

Random Scheduling Algorithms• In time slot n, let S(n) be equal to one of the 6 possible

matchings independently and uniformly at random • Stability of Random

– Consider L11(n), the number of packets in VOQ11 • arrivals to VOQ11 occur according to A11(n), which is Bernoulli IID • input rate = λ11 = α/3 • this queue gets served whenever the service matrix connects input 1 to

output 1 • There are 2 service matrices that connect input 1 to output 1 • since Random chooses service matrices u.a.r., input 1 is connected to

output 1 1. for a fraction of time = 2/6 = 1/3 --- the service rate between input1 and output1

• E(L11(n)) < iff λ11 < 1/3 α < 1 • This random algorithm is stable.

Page 114: Packet Scheduling/Arbitration in Virtual Output Queues and Others

114CSIT560 by M. Hamdi

Random Scheduling Algorithms

• Instability of Random • Now suppose λii = α for all i and λij =0 for

– clearly, this is admissible traffic for all α < 1 – but, under Random, the service rate at VOQ11 is 1/3 at

best– hence VOQ11 and the switch will be unstable as soon as

• Stability (or 100% throughput) means it is stable under all admissible traffic!

ji

3/1

Page 115: Packet Scheduling/Arbitration in Virtual Output Queues and Others

115CSIT560 by M. Hamdi

Obvious Randomized Schemes• Choose a matching at random and use it as the

schedule doesn’t give 100% throughput (already shown)

• Choose 2 matchings at random and use the heavier one as the schedule

• Choose N matchings at random and use the heaviest one as the schedule

None of these can give 100% throughput !!

Page 116: Packet Scheduling/Arbitration in Virtual Output Queues and Others

116CSIT560 by M. Hamdi

0.001

0.01

0.1

1

10

100

1000

10000

0.0 0.2 0.4 0.6 0.8 1.0

Mea

n IQ

Len

Normalized Load

Diagonal Traffic

MWM R32R1

Page 117: Packet Scheduling/Arbitration in Virtual Output Queues and Others

117CSIT560 by M. Hamdi

Iterative Randomized Scheme(Tassiulas)

• Say M is the matching used at time t

• Let R be a new matching chosen uniformly at random (u.a.r.) among the N! different matchings

• At time t+1, use the heavier of M and R• Complexity is very low O(1) iterations • This gives 100% throughput !

note the boost in throughput is due to memory (saving previous matchings)

• But, delays are very large

Page 118: Packet Scheduling/Arbitration in Virtual Output Queues and Others

118CSIT560 by M. Hamdi

0.01

0.1

1

10

100

1000

10000

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

Mea

n IQ

Len

Normalized Load

Diagonal Traffic

MWMTassiulas

Page 119: Packet Scheduling/Arbitration in Virtual Output Queues and Others

119CSIT560 by M. Hamdi

Finer Observations

• Let M be schedule used at time t

• Choose a “good’’ random matching R

• M’ = Merge(M,R)

• M’ includes best edges from M and R

• Use M’ as schedule at time t+1

• Above procedure yields algorithm called LAURA• There are many other small variations to this algorithm.

Page 120: Packet Scheduling/Arbitration in Virtual Output Queues and Others

120CSIT560 by M. Hamdi

3

2

32

2

1

23

4

1Merging3

2

3

3

1

X R3-1+2-2=2

2-1+2-4=-1

W(X)=12 W(R)=10

MW(M)=13

Merging Procedure

Page 121: Packet Scheduling/Arbitration in Virtual Output Queues and Others

121CSIT560 by M. Hamdi

0.01

0.1

1

10

100

1000

10000

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

Mea

n IQ

Len

Normalized Load

Diagonal Traffic

MWMM-LAURA LAURAiLQFTassiulas

Page 122: Packet Scheduling/Arbitration in Virtual Output Queues and Others

122CSIT560 by M. Hamdi

Can we avoid having schedulers altogether !!!

Page 123: Packet Scheduling/Arbitration in Virtual Output Queues and Others

123CSIT560 by M. Hamdi

Recap:Recap: Two Successive Scaling Two Successive Scaling ProblemsProblems

OQ routers: + work-conserving (QoS)- memory bandwidth =

(N+1)RR

R

RR

IQ routers: + memory bandwidth = 2R- arbitration complexity

Bipartite Matching

R R

Page 124: Packet Scheduling/Arbitration in Virtual Output Queues and Others

124CSIT560 by M. Hamdi

Today: 64 ports at 10Gbps, 64-byte cells.

• Arbitration Time = = 51.2ns

• Request/Grant Communication BW = 17.5Gbps

10Gbps 64bytes

IQ Arbitration Complexity

Two main alternatives for scaling:1. Increase cell size2. Eliminate arbitration

Scaling to 160Gbps:• Arbitration Time = 3.2ns• Request/Grant Communication BW = 280Gbps

Page 125: Packet Scheduling/Arbitration in Virtual Output Queues and Others

125CSIT560 by M. Hamdi

Desirable Characteristics for Router Architecture

Ideal: OQ• 100% throughput• Minimum delay• Maintains packet order

Necessary: able to regularly connect any input to any output

What if the world was perfect? Assume Bernoulli iid uniform arrival traffic...

Page 126: Packet Scheduling/Arbitration in Virtual Output Queues and Others

126CSIT560 by M. Hamdi

Round-Robin Scheduling

• Uniform & non-bursty traffic => 100% throughput• Problem: traffic is non-uniform & bursty

Page 127: Packet Scheduling/Arbitration in Virtual Output Queues and Others

127CSIT560 by M. Hamdi

Two-Stage Switch (I)

1

N

1

N

1

N

External Outputs

Internal Inputs

External Inputs

First Round-Robin Second Round-Robin

Page 128: Packet Scheduling/Arbitration in Virtual Output Queues and Others

128CSIT560 by M. Hamdi

Two-Stage Switch (I)

1

N

1

N

1

N

External Outputs

Internal Inputs

External Inputs

First Round-Robin Second Round-Robin

Load Balancing

Page 129: Packet Scheduling/Arbitration in Virtual Output Queues and Others

129CSIT560 by M. Hamdi

• 100% throughput• Problem: unbounded mis-sequencing

External Outputs

Internal Inputs

1

N

ExternalInputs

Cyclic Shift Cyclic Shift

1

N

1

N

1 1

2

2

Two-Stage Switch Characteristics

Page 130: Packet Scheduling/Arbitration in Virtual Output Queues and Others

130CSIT560 by M. Hamdi

Two-Stage Switch (II)

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

F ik

F ik

.

.

.

.

.

.

.

FlowSplitter

LoadBalancer VOQs First-Stage Round-Robin Second-Stage Round-RobinVOQs

External inputs Internal outputs Internal inputs External outputs

1 1 1

N N N

1

N

1

N

i

.

.

.

.

.

.

.

.

.

.

.

.

j

.

.

.

.

.

.

.

.

.

.

.

.

j

.

.

.

.

.

.

.

.

.

.

.

.

k

.

.

.

.

.

.

.

.

.

.

.

.

New

N3 instead of N2

Page 131: Packet Scheduling/Arbitration in Virtual Output Queues and Others

131CSIT560 by M. Hamdi

Expanding VOQ Structure

Solution: expand VOQ structure by distinguishing among switch inputs

2

1

3

a

b

Page 132: Packet Scheduling/Arbitration in Virtual Output Queues and Others

132CSIT560 by M. Hamdi

What is being done in practice(Cisco for example)

• They want schedulers that achieve 100% throughput and very low delay (Like MWM)

• They want it to be as simple as iSLIP in terms of hardware implementation

• Is there any solution to this !!!!!

Page 133: Packet Scheduling/Arbitration in Virtual Output Queues and Others

133CSIT560 by M. Hamdi

Typical Performance of ISLIP-like Algorithms

PIM with 4 iterations

Page 134: Packet Scheduling/Arbitration in Virtual Output Queues and Others

134CSIT560 by M. Hamdi

What is being done in practice(Cisco for example)

Company Switching Capacity

Switch Architecture

Fabric Overspeed

Agere 40 Gbit/s-2.5 Tbit/s Arbitrated crossbar 2x

AMCC 20-160 Gbit/s Shared memory 1.0x

AMCC 40 Gbit/s-1.2 Tbit/s Arbitrated crossbar 1-2x

Broadcom 40-640 Gbit/s Buffered crossbar 1-4x

Cisco 40-320 Gbit/s Arbitrated crossbar 2x