jon turner [email protected] jst resilient cell resequencing in terabit routers
TRANSCRIPT
2 - Jonathan Turner
Why Resequencing?
Multistage interconnection networks with buffered switch elements and dynamic routing.»scalable, bandwidth-efficient architecture»makes good use of modern CMOS ICs - single chip can
provide 100 Gb/s thruput and buffer thousands of cells
Load Balancing
Stage
Input Line
Cards
Output Line
Cards
Shared Memory Switch
Elements
3 - Jonathan Turner
Resequencing with Sequence Numbers
Drawbacks»each output needs N resequencing arrays» initialization when line card comes on-line»timeouts needed to cope with lost cells»multicast requires per flow resequencing arrays
Sequence Numbers
AddedReseq. Arrays
indexed by seq. #
4 - Jonathan Turner
Time-Based Resequencing
Single resequencing buffer per output. Cells held until “age” exceeds threshold (T ). Options for late cells.
»discard (strict resequencing) or buffer (loose)
Timestamps Added
Timestamp Ordered
Reseq. Buffer
Output Buffer
5 - Jonathan Turner
Henrion’s Strict Resequencer Design
Implemented using linked lists in common memory. Constant time per cell.
arriving cell ts“ready” cells
T slot timing wheel
nowinsert at timestamp
mod T
current time modulo T
6 - Jonathan Turner
now
Implementing Loose Resequencing
Only approximates loose resequencing.» must still discard “really late” cells.» low cost of large timing wheel allows good approximation
lagnow
T
arriving cell ts1 ts1+T
late arriving cell ts2 ts2+T
“Normal”
insertion range
“ready”cells
“waiting”
cells
next next
lag
new cell inserted at
timestamp+T
Cannot just insert late cells into output list.
7 - Jonathan Turner
Fast-Forwarding the Lag Pointer Lag pointer must advance to next non-empty list
on every clock tick for constant time operation.»no time to check successive pointers in timing wheel»use fast-forward bits to speed-up process
0101000101001101000000000000010000000000000000000011000101001101
11010011
summary word
bit for current
time
next non-empty timing
wheel slot
Two memory reads suffice to find next slot.» 32 bit words allows range of 1024» 128 bit words allows range of 16,384
8 - Jonathan Turner
Synchronization Time-based resequencing requires
synchronization of all line cards.» in small routers, requires just a common backplane
signal» in large routers, line cards connected to network only
by optical data cables Requires low-level clock synchronization protocol.
»“master” line card issues periodic broadcast synchronization messages
»network forwards sync messages with constant delay only approximate synchronization is necessary
»new clock master selected on failure Independent line card clocks require adjustments.
»suspend transmission when delaying clock
9 - Jonathan Turner
1.E-05
1.E-04
1.E-03
1.E-02
1.E-01
1.E+00
0.95 0.96 0.97 0.98 0.99 1.00Input Load
Lat
e Pro
babi
lity
simple random traffic
fixed threshold, T =128
speedup=1.01
1.03
1.05
Performance of Strict Resequencing
1.E-05
1.E-04
1.E-03
1.E-02
1.E-01
1.E+00
1.01 1.03 1.05 1.07 1.09Speedup
Lat
e Pro
babi
lity
simple random traffic
fixed threshold
input load = 1.0
T =64
128
256
Simple random traffic. 3 stage network, 8 port SEs, 512 (shared) cell buffers. 1st stage SEs use round robin load balancing for each input.
Late cells rare with
small speedup
For systems with 10G links,
delay for 256 cells is 10 s.
10 - Jonathan Turner
0
50
100
150
200
250
300
350
400
450
250 500 750 1000 1250 1500 1750 2000
Time
resequencer occupancy
age of oldest cell
network delay
poor performance
overload
agethreshold
fixed, strict, T=128
Performance on Adversarial Traffic
Growing network delay
Cells discarded when delay exceeds T
Delay drops as SE buffers
drain
Resequencer recovers when
delay drops below T
2:1 overload at
“target” output
11 - Jonathan Turner
Performance on Bursty Traffic
1.E-05
1.E-04
1.E-03
1.E-02
1.E-01
1.E+00
1 2 3 4 5 6 7 8 9 10Mean Dwell Time
Lat
e Pr
obab
ility speedup=1.1
1.31.5
bursty traffic
strict, fixed, T =128
100% input load. Input picks an output at random – overload at 1 in 4 outputs Stays with target output for geometrically distributed time.
Poor performance even for large
speedups.
12 - Jonathan Turner
0
50
100
150
200
250
300
350
400
450
250 500 750 1000 1250 1500 1750 2000
Time
agethreshold
overload
age of oldest cell
networkdelay
resequenceroccupancy
fixed, loose, T=128
Loose Reseq. with Adversarial TrafficArriving cells are
younger than oldest waiting cells, so no resequencing errors
13 - Jonathan Turner
1.E-05
1.E-04
1.E-03
1.E-02
1.E-01
1.E+00
0 5 10 15 20 25 30Mean Dwell Time
Lat
e Pro
babi
lity
speedup=1.1
1.3
1.5
loose,fixed, T =128
bursty traffic
Loose Reseq. with Bursty Traffic
Tolerates about 3x longer bursts than strict reseq.
14 - Jonathan Turner
Adaptive Resequencing Adjust age threshold to match observed delay.
»parameters: window size (W), short term delay difference bound ()
»variables: max delay in current measurement window (d0)and previous measurement window (d-1)
»age threshold = + max{d0, d-1}
» implement by extending loose, fixed threshold design Theorem. (simplified). If cell c1 enters
interconnection network just before c2 and exits no later than after c2, then adaptive resequencer with W forwards c1 before c2.
Resequencing errors caused by excessive delay variability, rather than large delays.
15 - Jonathan Turner
0
50
100
150
200
250
300
350
400
450
250 500 750 1000 1250 1500 1750 2000
Time
age threshold
networkdelay
age of oldest cell
resequenceroccupancy
overload adaptive, W==32
Performance on Adversarial Traffic
Arriving cells are younger than oldest waiting cells, so no resequencing errors
Resequencer adds small
increment to network delay.
Age threshold tracks network delay.
Resequencer occupancy stays
bounded.
16 - Jonathan Turner
Performance on Bursty Traffic
1.E-05
1.E-04
1.E-03
1.E-02
1.E-01
1.E+00
0 10 20 30 40 50 60Mean Dwell Time
Lat
e Pro
babi
lity
bursty traffic, adaptive, =64
speedup=1.1
1.3
1.5
late+input loss
Tolerates about 2x longer bursts than loose reseq.
Less sensitive to extremely large bursts
Performance degrades when
switch buffers fill. Caused by delay variation in first
stage.
17 - Jonathan Turner
Boosting Performance for Long Bursts
1.E-07
1.E-06
1.E-05
1.E-04
1.E-03
1.E-02
1.E-01
1.E+00
20 40 60 80 100 120 140 160 180 200
Lat
e Pro
babi
lity
first stage buffer capacity
256128
32
64
bursty traffic, dwell=100, speedup=1.2
adaptive
resequencer
512
Limiting first stage buffering cuts variability.
Small first stage buffer gives good performance with
modest “extra” delay
Smallest buffer reduces switch throughput by
2%.
18 - Jonathan Turner
Speeding up Threshold Reductions Downward age threshold adjustments are delayed
by window mechanism. Can reduce delay using “finer-grained” windows.
»max delay d0,d-1,d-2, . . . in k measurement windows
»age threshold = + max{d0, d-1,d-2, . . .}
Theorem. (simplified). If cell c1 enters interconnection network just before c2 and exits no later than after c2, then adaptive resequencer with (k1)W forwards c1 before c2.
For large k, max delay in threshold adjustment is cut almost in half.
19 - Jonathan Turner
Closing Remarks Adaptive resequencing can virtually eliminate
resequencing errors in multistage networks.»to handle most extreme traffic, need to limit delay
variability in load balancing stages» in systems that regulate the overall flow of traffic,
extreme cases should not arise at all Henrion’s strict resequencer can be modified for
adaptive resequencing – O(1) time per cell. More sophisticated interconnection network can
increase throughput and reduce delay variability.»per destination queues with controlled buffer sharing»per destination flow control and load balancing»timestamp-ordered switch element queues