lect4_switchcore
TRANSCRIPT
-
8/8/2019 lect4_switchcore
1/39
Switch core architecture
qS w itch in g fa b ricqQ u e u in g v a ria tio n sqS w itch sch e d u lin g a lg o rith m s
-
8/8/2019 lect4_switchcore
2/39
Switch core in a router
Arbiter
Optional input queueOptional output queue
Switch fabric that
provides
parallel data path
-
8/8/2019 lect4_switchcore
3/39
Switch fabric topologies
Crossbar
Simple space division switch
Each crosspoint can be turned on or
off
configuration
DataIn
Data Out
-
8/8/2019 lect4_switchcore
4/39
Crossbar
Allows any permutation communicationto be non-blocking
Permutation communication: each inputport connects to a distinct output port.
Each node can send/receive at mostonce
Example:
-
8/8/2019 lect4_switchcore
5/39
Crossbar
Advantages:
Simple to implement
Simple control
Flexible
Drawback
Number of crosspoints, not scalable
Good for small N
-
8/8/2019 lect4_switchcore
6/39
Multi-stage switch
In a crossbar, in each switching phase,only one crosspoint in each row orcolumn is active.
One connection goes through onecrosspoint
The objective is to achieve nonblockingcommunication for permutations.
Multi-stage network basic idea Compress the crosspoints:
Each connection goes through multiplecrosspoints
Reduce the number of crosspoints whilestill have nonblocking communication
-
8/8/2019 lect4_switchcore
7/39
Clos networks
Clos networks (Clos 1953) is thefather of all multi-stage networks
Basic form: 3 stage Clos networks
Three stages: input, middle, andoutput
k input switches, m middle switches, k
output switches Input stage: n x m switches
Middle stage: k x k switches
Output stage: m x n switches
Each input switches connects to each
of the middle switches; each
-
8/8/2019 lect4_switchcore
8/39
Clos networks
Total numberof switches?
Total numberofinput/outputports?
-K 1 -K 1
Input stage,
n x m switches
Middle stage,
k x k switches
output stage,
m x n switches
-
8/8/2019 lect4_switchcore
9/39
An example Clos network
Building a large 6x6 nonblocking switch with 6 2x2 switches and 2 3x3 switch
-
8/8/2019 lect4_switchcore
10/39
Another example Closnetwork
Is this network nonblocking? Why?
The total width of middle stage is 6
/ .while the total input output width is 9 When is a Clos network nonblocking then?
-
8/8/2019 lect4_switchcore
11/39
About non-blockingnetworks
Strict-sense nonblocking: can find aroute from a free input to a freeoutput without changing existingroutes.
Wide-sense nonblocking: can find aroute from a free input to a freeoutput without changing existingroutes by suitably choosing routes for
new connections. Rearrangeably nonblocking: can route
permutation without contention (mayrequire old connections to be
rerouted).
-
8/8/2019 lect4_switchcore
12/39
Nonblocking conditions for Closnetworks
Clos Theorem (1953): A Clos networkis strict-sense nonblocking if andonly if m >= 2n-1.
Proof?
-
8/8/2019 lect4_switchcore
13/39
Nonblocking conditions for Closnetworks
Benes Theorem (1962): A Clos networkis rearrangeably nonblocking iff m >=n.
Necessary condition is straight-forward Sufficient condition:
Build a bipartite graph Nodes: input and output switches Edges: connections (from input to output
switches) Maximum degree
-
8/8/2019 lect4_switchcore
14/39
Strict sense nonblocking andrearrangeably nonblocking
example
( , , ) , -s 2 2 3 is re arra n g e a b ly n on b lockin g b u t n o t strict se n se n on b lo cki
-
8/8/2019 lect4_switchcore
15/39
Recursive construction of largeswitches
Each switch in the 3 stage Closnetworks can be recursiveconstructed by smaller switches.
Construct an N x N nonblocking switchwith 2x2 switches (Benes networks): Input/output stage: N/2 2x2 switches Middle stage: 2 N/2 x N/2 switches
(recursively build)
How many 2x2 switches needed?How many crosspoints needed?
( ) =T N ?
-
8/8/2019 lect4_switchcore
16/39
16 x 16 Benes network
-
8/8/2019 lect4_switchcore
17/39
Multistage networks
Many variations of multi-stage networks Clos, Benes, Banyan, Cantor, etc.
With different objectives
The idea is somewhat similar.
All of them try to provide crossbarfunctionality: all try to achievenonblocking communication forpermutations.
Both crossbar and multistage networksare used as switching fabric.
-
8/8/2019 lect4_switchcore
18/39
What is besides switchingfabric?
What problem does the switching fabricsolve?
Any permutation can be done in one
switching circle. 1-to-1 demand
The traffic in router is more complexthan permutation.
Many-to-1 demand?
Not all packets can get through the fabricright away.
Still need buffering and scheduling!!
-
8/8/2019 lect4_switchcore
19/39
Switch Model
N x N switch
Fix sized packet (cell), much easierfor switch to manage. Mostpractical switches use fixed sizedcells.
All line rates are the same: line cards
aggregate lines with different rates. Switching circle: arrival of time
between cells (determined by the
line rate)
-
8/8/2019 lect4_switchcore
20/39
Input/output queuingvariations
Output queuing: buffering at the switch output. Maximum throughput
Memory must be N times faster than line speed.
Memory speed is already a bottleneck!!
Not a choice for top of the line routers
-
8/8/2019 lect4_switchcore
21/39
Input/output queueingvariations
Input queueing: buffering at theswitch input.
-
8/8/2019 lect4_switchcore
22/39
Input/output queueingvariations
Head of line blocking with inputqueuing
Throughput can be significantly
affected.
-
8/8/2019 lect4_switchcore
23/39
Input/output queueingvariations
Impact of end of line blocking:maximum throughput = 2-sqrt(2) =58.6%
De
lay
Load58.6% 100%
-
8/8/2019 lect4_switchcore
24/39
Input/output queueingvariations
Virtual output queueing (input queue): Use a separate queue for each output port in each
line card.
Remove the head of line blocking
Arbiter becomes more complex
-
8/8/2019 lect4_switchcore
25/39
Input/output queueingvariations
Combined Input Output Queueing(CIOQ)
Queues in both input and output
Memory speedup 1
-
8/8/2019 lect4_switchcore
26/39
Input queue scheduling, thebipartite matching problem
The scheduling algorithm should try tomaximize the number of connection tomaximize the throughtput
-
8/8/2019 lect4_switchcore
27/39
Maximum and maximalmatching
Maximum matching: find the largest number ofconnections
How to do it?
O(N^3) complexity, starvation
Maximal matching Cannot add any connection on the matching without
causing problem
More practical
Maximum matchingThe problem Maximal matching
-
8/8/2019 lect4_switchcore
28/39
Practical Matchingalgorithms
PIM parallel iterative matching
RRM Round-Robin matching
iSLIP iterative serial-line IP
-
8/8/2019 lect4_switchcore
29/39
PIM
Repeat until no new matching isfound
1.Request: each unmatched input
sends a request to every output forwhich it has a queued cell
2.Grant: If an unmatched outputreceives any requests, it randomlygrants one.
3.Accept: If an input receives grants, itrandomly accept one.
-
8/8/2019 lect4_switchcore
30/39
PIM example
R e q u e s t G ra n t A c c e p t
-
8/8/2019 lect4_switchcore
31/39
PIM example
T h e n e x t ite ra tio n
-
8/8/2019 lect4_switchcore
32/39
PIM property
Converge in O(logN) iterations onaverage (what is the worst case?)
Does not perform well for single
iteration 63% (1-1/e) of the throughput
Computed from the probability that aninput remain ungranted.
Hardware random number generator? We would like to have algorithm that
perform well in one iteration!This function is in the critical data path.
-
8/8/2019 lect4_switchcore
33/39
RRM Round robin matching
Request: the same
Grant: if an output receives requests, itchooses the one that appears next in
a fixed round-robin schedule startingfrom the highest priority element Increment the round robin pointer
Accept: if a input receives a grant, it
accepts the one that appears next ina fixed round-robin schedule startingfrom the highest priority
Increment the round robin pointer
-
8/8/2019 lect4_switchcore
34/39
RRM example
-
8/8/2019 lect4_switchcore
35/39
RRM
RRM has lower complexity than PIM
RR arbiters are simpler than randomarbiters
Deterministic method Can perform poorly for certain pattern
The output arbiters are somewhat
synchronized Can have starvation
-
8/8/2019 lect4_switchcore
36/39
iSLIP
An variation of RRM: not movinggrant arbiters unless the grant isaccepted.
Algorithm is the same as RRM exceptthat in grant, the RR pointer isincremented to one location
beyond the granted input if andonly if the grant is accepted in step3.
-
8/8/2019 lect4_switchcore
37/39
iSLIP example
-
8/8/2019 lect4_switchcore
38/39
iSLIP properties
Property 1: Lowest priority is given tothe most recently madeconnection.
Property 2: No starvation, at most N^2 scheduling circles to be served.
Property 3: Under heavy load, all
queues with a common output havethe same throughput.
-
8/8/2019 lect4_switchcore
39/39
iSLIP properties
Simple to implement
Starvation free
Throughput is about 100%
Fair
As load increases, get larger sizedmatch