hpsr 2006 distributed crossbar schedulers cyriel minkenberg 1, francois abel 1, enrico schiattarella...

HPSR 2006

Distributed Crossbar Schedulers

Cyriel Minkenberg1, Francois Abel1, Enrico Schiattarella2 1 IBM Research, Zurich Research Laboratory2 Dipartimento di Elettronica, Politecnico di Torino

OSMOSIS

HPSR 2006 © 2006 IBM Corporation

Outline

OSMOSIS overview

Challenges in the OSMOSIS scheduler design

Basics of crossbar scheduling

Distributed scheduler Architecture

Problems

Solutions

Results

Implementation

OSMOSIS


OSMOSIS Overview

EQ

control

2 Rx

central scheduler(bipartite graph matching algorithm)

VOQs

Tx

control

64 Ingress Adapters

All-optical Switch

64 Egress Adapters

EQ

control

2 Rx

control links

8 Broadcast Units128 Select Units

8x11x88x1

Com- biner

Fast SOA 1x8FiberSelectorGates

Fast SOA 1x8FiberSelectorGates

Fast SOA 1x8WavelengthSelectorGates

Fast SOA 1x8WavelengthSelectorGates

OpticalAmplifier

WDM Mux

StarCoupler

8x1 1x128VOQs

Tx

control

1

8

1

128

1

64

1

64

request2 grant4a

centralscheduler

(BGM)

3

all-opticalpacket transfer

5

packetwaiting

1

4b SOA switch

command

64 ports @ 40 Gb/s, 256-byte cells => 51.2 ns time slot Broadcast-and-select architecture (crossbar) Combination of wavelength- and space-division multiplexing Fast switching based on SOAs Electronic input and output adapters, electronic arbitration

OSMOSIS


Architectural Scheduler Challenges

Latency < 1 s Pr: Long permission latency (RTT + scheduling)

So: Speculation

Multicast support Pr: Fair integration with unicast scheduling, control channel overhead

So: Independent schedulers with filter, merge & feedback scheme

Scheduling rate = cell rate Pr: Produce one high quality matching every 51.2 ns

So: Deeply pipelined matching with parallel sub-schedulers (FLPPR)

FPGA-only scheduler implementation Pr: Does a 64-port scheduler fit in one FPGA device?

If not, how do we distribute it over multiple devices while maintaining an acceptable level of performance?

OSMOSIS


Crossbar Scheduling: Bipartite Graph Matching

A crossbar is a non-blocking fabric that can transfer cells from any input to any output with the following constraints:

At most one cell from any input

At most one cell to any output Equivalent to Bipartite Graph Matching (BGM)

inpu

ts

outputs

inpu

ts

outputs

maximal,size=3

inpu

ts

outputs

maximum,size=4in

puts

outputs

maximal,size=2

requestmatrix

OSMOSIS


One matching must be computed in every time slot, so we need fast and simple algorithms

Suitable class of algorithms is parallel, iterative, and based on round-robin pointers i-SLIP (McKeown), DRRM (Chao)

These algorithms have a number of desirable features: 100% throughput under uniform i.i.d. traffic

Starvation-free: any VOQ is served within finite time under any traffic pattern

Iterative: sequential improvement of the matching by repeating steps

Amenable to fast hardware implementation; high degree of parallelism and symmetry

Pointer-based Parallel Iterative Matching

OSMOSIS


DRRM Operation

IS[1] OS[1]

IS[2] OS[2]

IS[3] OS[3]

IS[4] OS[4]

VOQstate

inputselectors

outputselectors Step 0: Initially, all inputs and outputs

are unmatched Step 1: Each unmatched input

requests the first unmatched output in round-robin order for which it has a packet, starting from pointer R[i]. R[i] (R[i] + 1) modulo N iff the request is granted in Step 2 of the first iteration

Step 2: Each output grants the first input in round-robin order that has requested it, starting from pointer G[o]. G[o] (G[o] + 1) modulo N

Iterate: Repeat Steps 1 and 2 until no more edges can be added or a fixed number of iterations are completed

Key to good performance is pointer desynchronization

If all VOQs are non-empty, pointers eventually all point to different outputs

No conflicts: maximum performance

OSMOSIS


Distribution Issues Problem: Scheduler does not fit in a single device due to area constraints

Quadratic complexity growth of priority encoders

Monolithic implementation (implicit temporal and spatial assumptions) All results are available before the next time slot (or iteration) All required information is available to all selectors

Distributed implementation breaks these assumptions Main problem: input selector issues a request at t0 and receives result (granted or not) at t0 + RTT Input selector does not know results of requests issued during last RTT Selectors are only aware of local status info (e.g. matches made in previous iterations)

The time required for information to travel from the inputs to the outputs and back is called round-trip time (RTT)

= RTT / (cell duration)

IS[1]

OS[1]

IS[N]

OS[N]

RTT >> cell duration

request

grant

output selectionand status update

input status updateand selection

RTT

time

OSMOSIS


Coping with Uncertainty (1)

Problem: Uncertainty in the algorithm’s status The pointer-update mechanism breaks

– No desynchronization Throughput loss

Solution: Maintain a separate pointer set for each time slot in the RTT Basic idea: No pointer is reused before the last result is available

– Each input (output) selector maintains distinct request (grant) pointers, labeled R[ t ] and G[ t ], with t [0, -1]

– At time slot t the input selectors use set R[t mod ] to generate requests; each request carries the ID of pointer set used

– Output selectors generate grants using G[ t ] in response to requests from R[ t ]

Each pointer set is updated independently from the others, so they all desynchronize independently. Therefore, all the good features DRRM are preserved

Pointer sets are only updated once every RTT, hence they take longer to desynchronize

OSMOSIS


Coping with Uncertainty (2)

Problem: Uncertainty in the algorithm’s status The VOQ-state update mechanism breaks

– How many requests were successful?– Excess requests may lead to “wasted” grants, leading to reduced

performance

Solution: Maintain a pending request counter for every VOQ P(i,j) tracks the number of requests issued for VOQ(i,j) over the last RTT

– Increment when issuing new request– Decrement when result arrives

Filter requests: if P(i,j) exceeds the number of unserved cells in VOQ(i,j) do not submit further requests

This massively reduces the number of wasted grants

OSMOSIS


Multi-pointer Approach (RTT = 4)

IS[1] OS[1]

OS[2]

OS[3]

OS[4]

outputselectors

R[t0;1]

R[t1;1]

R[t2;1]

R[t3;1]

1VOQstate

200

G[t0;1]

G[t1;1]

G[t2;1]

G[t3;1]

R[t3;2]R[t2;2]R[t1;2]R[t0;2]

R[t3;2]R[t2;2]R[t1;2]R[t0;3]

R[t3;2]R[t2;2]R[t1;2]R[t0;4]

requestpointer

set R[t3;2]R[t2;2]R[t1;2]G[t0;2]

R[t3;2]R[t2;2]R[t1;2]G[t0;3]

R[t3;2]R[t2;2]R[t1;2]G[t0;4]

IS[2]

IS[3]

IS[4]

inputselectors

requestpointers

grantpointers

grantpointer

set

Hardware cost ( -1) additional pointers at each

input/output, each log2N bits wide N2 pending request counters N -to-1 multiplexers Selection logic is not duplicated

pendingrequestcounters

00

11

OSMOSIS


Multiple Iterations

Additional uncertainty: Which inputs/outputs have been matched in previous iterations?

1. Inputs should not request outputs that are already taken: Wasted requests

2. Outputs should not grant inputs that are already taken: Violation of one-to-one matching property

Because of issue 2 above, the output selectors must be aware of all grants in previous iterations, also by other selectors Implement all output selectors in one device

Input selectors use a request flywheel pointer to create request diversity across multiple iterations

PRC filtering applies only to first iteration Can lead to “premature” grants

OSMOSIS


Distributed Scheduler Architecture

IS[1] OS[1]

IS[2] OS[2]

IS[3] OS[3]

IS[4] OS[4]

VOQstate

inputselectors

outputselectors

Allocators (on midplane)

control channel

control channel

control channel

control channelControl channel interfaces (each on a separate card)

switchcommandchannels

OSMOSIS


Performance Characteristics (16 ports)

Uniform Bernoulli traffic, RTT = 4

1

10

100

1000

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Throughput

Lat

ency

[ti

me

slo

ts] 1 iteration

2 iterations

3 iterations

4 iterations

8 iterations

16 iterations

monolithic


1

10

100

1000

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Throughput

Lat

ency

[ti

me

slo

ts] 1 iteration

2 iterations

3 iterations

4 iterations

8 iterations

16 iterations

monolithic


10

100

1000

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Throughput

Lat

ency

[ti

me

slo

ts] 1 iteration

2 iterations

3 iterations

4 iterations

8 iterations

16 iterations

monolithic

No PRCs, uniform Bernoulli traffic, RTT = 4

1

10

100

1000

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Throughput

Lat

ency

[ti

me

slo

ts] 1 iteration

2 iterations

3 iterations

4 iterations

8 iterations

16 iterations

monolithic

OSMOSIS


Optical Switch Controller Module (OSCM)

Midplane (OSCB; prototype shown here) with 40 daughter boards (OSCI; top right). Board layout (bottom right)

OSMOSIS


Thank You!

hpsr 2006 distributed crossbar schedulers cyriel minkenberg 1, francois abel 1, enrico schiattarella...

Documents