ibm zurich research lab gmbh1 the need for an improved pause mitch gusat and cyriel minkenberg ieee...

19
IBM Zurich Research Lab GmbH 1 The Need for an Improved PAUSE Mitch Gusat and Cyriel Minkenberg IEEE 802 Dallas Nov. 2006

Upload: sherman-wesley-ward

Post on 14-Dec-2015

217 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: IBM Zurich Research Lab GmbH1 The Need for an Improved PAUSE Mitch Gusat and Cyriel Minkenberg IEEE 802 Dallas Nov. 2006

IBM Zurich Research Lab GmbH 1

The Need for an Improved PAUSE

Mitch Gusat and Cyriel MinkenbergIEEE 802

Dallas Nov. 2006

Page 2: IBM Zurich Research Lab GmbH1 The Need for an Improved PAUSE Mitch Gusat and Cyriel Minkenberg IEEE 802 Dallas Nov. 2006

IBM Zurich Research Lab GmbH 2

Outline

I) Overcoming PAUSE-induced Deadlocks PAUSE exposed to circular dependencies Two deadlock-free PAUSE solutions

II) PAUSE Interaction with Congestion Management

III) Conclusions

Page 3: IBM Zurich Research Lab GmbH1 The Need for an Improved PAUSE Mitch Gusat and Cyriel Minkenberg IEEE 802 Dallas Nov. 2006

IBM Zurich Research Lab GmbH 3

PAUSE Issues

• PAUSE-related issues interfere with BCN simulations• Correctness

Deadlocks cycles in the routing graph (if multipath adaptivity is enabled)

– multiple solutions exist circular dependencies (in bidir fabrics)

BCN can’t help this => Solutions required

• Performance (to be elaborated in a future report) low-order HOL-blocking and memory hogging

Non-selective PAUSE causes hogging, i.e., monopolization of common resources: e.g. shared memory may be monopolized by frames for a congested port (as shown here)

Consequences – best: reduced throughput– worst: unfairness, starvation, saturation tree, collapse

properly tuned, BCN can address this problem

Page 4: IBM Zurich Research Lab GmbH1 The Need for an Improved PAUSE Mitch Gusat and Cyriel Minkenberg IEEE 802 Dallas Nov. 2006

IBM Zurich Research Lab GmbH 4

A 3-level Bidir Fat Tree Unfolded

• Using shared-memory switches with global PAUSE in a bidirectional fat tree network can cause deadlock Circular dependencies (CD) != loops in the routing graph (STD) Deadlocks were observed in BCN simulations

Root = ‘hinge’ to unfold

Page 5: IBM Zurich Research Lab GmbH1 The Need for an Improved PAUSE Mitch Gusat and Cyriel Minkenberg IEEE 802 Dallas Nov. 2006

IBM Zurich Research Lab GmbH 5

PAUSE-caused Deadlocks in BCN Simulations 16-node 5-stage fabric Bernoulli traffic

SM, no BCN

SM, BCN

Partitioned, w/ BCN

Partitioned, no BCN

Page 6: IBM Zurich Research Lab GmbH1 The Need for an Improved PAUSE Mitch Gusat and Cyriel Minkenberg IEEE 802 Dallas Nov. 2006

IBM Zurich Research Lab GmbH 6

The Mechanism of PAUSE-induced CD Deadlocks

• When incorrectly implemented, PAUSE-based flow control can cause hogging and deadlocks

• PAUSE-deadlocking in shared-memory switches: Switches A and B are both full

(within the granularity of an MTU or Jumbo) => PAUSE thresholds exceeded

All traffic from A is destined to B and viceversa

Neither can send, waiting on each other indefinitely: Deadlock.

Note: Traffic from A never takes the path from B back to A and vice versa

Due to shortest-path routing

A

B

Page 7: IBM Zurich Research Lab GmbH1 The Need for an Improved PAUSE Mitch Gusat and Cyriel Minkenberg IEEE 802 Dallas Nov. 2006

IBM Zurich Research Lab GmbH 7

Two Solutions to Defeat the Deadlock

• I. Architectural: Assert PAUSE on a per-input basis No input is allowed to consume more than 1/N-th of

the shared memory All traffic in B’s input buffer for A is guaranteed to

be destined to a different port than the one leading back to A (and vice versa)

Hence, the circular dependency has been broken!• Confirmed by simulations

Assert PAUSE on input i: occmem >= Th or occ[i] >= Th/N

Deassert PAUSE on input i: occmem < Th and occ[i] < Tl/N

Qeq = M / (2N)

• II. LL-FC: Bypass Queue, distinctly PAUSE-d Achieves similar result as (I), plus:

independent of switch architecture (and implementation)

required for IPC traffic (LD/ST, request/reply) compatible w/ PCIe (dev. driver compatibility)

A

B

Page 8: IBM Zurich Research Lab GmbH1 The Need for an Improved PAUSE Mitch Gusat and Cyriel Minkenberg IEEE 802 Dallas Nov. 2006

IBM Zurich Research Lab GmbH 8

Simulation of BCN with Deadlock-free PAUSE

• Observations Qeq should be set to partition the shared memory

Setting it higher promotes hogging Setting it lower wastes memory space

BCN works best with large buffers per port Buffer size per port should be significantly larger than mean

burst size 256 frames per port

Page 9: IBM Zurich Research Lab GmbH1 The Need for an Improved PAUSE Mitch Gusat and Cyriel Minkenberg IEEE 802 Dallas Nov. 2006

IBM Zurich Research Lab GmbH 9

PAUSE Interaction with Congestion Management

• What is the effect of deadlock-free PAUSE on BCN?

• Memory partitioning ‘stiffens’ the feedback loop

• PAUSE triggers backpressure tree earlier Backrolling propagation speed depends not only on the

available memory, but also on the switch service discipline

• Next: Static analysis of PAUSE-BCN interference, function of the switch service discipline

Note: To visualise the analytical iterations, enable animation.

Page 10: IBM Zurich Research Lab GmbH1 The Need for an Improved PAUSE Mitch Gusat and Cyriel Minkenberg IEEE 802 Dallas Nov. 2006

10

Simple Analytical Method

• Method used in this presentation explicit assumptions simple traffic scenario reduced MIN topology, with static/deterministic routing (fixed)

• This ‘model’ considers queuing – in Eth. Channel Adapter (ECA) and switch element (SE) scheduling – in ECA and SE Ethernet’s per-prio PAUSE-based LL-FC (aka backpressure - BP) reactive CM a la BCN

• Linearization around steady-state => tractable static analysis salient transients will be mentioned, but not computed Compute the cumulative effects of

– scheduling, – LL-FC backpressure per prio (only one used here), – CM source throttling (rate adjustment)

Do not compute the formulas for– blocking probability per stage and SE – variance of service time distribution – Lyapunov stability

Page 11: IBM Zurich Research Lab GmbH1 The Need for an Improved PAUSE Mitch Gusat and Cyriel Minkenberg IEEE 802 Dallas Nov. 2006

11

Model and Traffic assumptions

• Traffic = ∑(background + hot)“A total of 50% of link rate is attempted from 9 queues ( 8 background + 1 hot) from each ECA.”

• Bgnd traffic: 8 queue/ECA on the left. Each of the 8 queues is connected to one of the 8 ECAs on the right.

=> 64 flows (8 queue/ECA x 8) on the left that are each injecting packets.

“80% of these [total link rate] are background, that is 80%x50% = 40% of link rate.”

=> background traffic intensity λ=0.4 is uniformly space-distributed

• Hot traffic: “20% of these are hot, so hot traffic is 20%x50% = 10% of link rate.”

.4

.4

.4

.4

.4

.4

.4

.4

.4

.4

.4

.4

.4

.4

.4

.4

.4

.4

.4

.4

.4

.4

.4

.4

.4

.4

.4

.4

.4

.4

.4

.4

+.1

+.1

+.1

+.1

+.1

+.1

+.1

+.1

+.2

+.2

+.2

+.2

+.4

+.4

+.8

Page 12: IBM Zurich Research Lab GmbH1 The Need for an Improved PAUSE Mitch Gusat and Cyriel Minkenberg IEEE 802 Dallas Nov. 2006

12

120% Link Load => 20% Overload - What Happens Next?

• Hotspot arrival intensity: λbgnd + λhot = .4 + .8 = 1.2 > 1

=> Overload , [mild] congestion factor = 1.2 @ SE (L2,S3)

...next ?

• BP and CM will react if SE(L2,S3) is work-conserving, 0.2 overload must be losslesy squelched by CM and BP

• The exact sequence depends on the actual traffic, SE architecture and threshold settings.

• Irrelevant for static analysis, albeit important in operation

• Separation of concerns -> Study the independent effects of BP (1st) and CM (2nd) iff linear system in steady-state -> superposition allows to compose the effects

.4

.4

.4

.4

.4

.4

.4

.4

.4

.4

.4

.4

.4

.4

.4

.4

.4

.4

.4

.4

.4

.4

.4

.4

.4

.4

.4

.4

.4

.4

.4

.4

+.1

+.1

+.1

+.1

+.1

+.1

+.1

+.1

+.2

+.2

+.2

+.2

+.4

+.4

+.6cf = 1.2

S1

L4

L3

L2

L1

S2 S3

CM

BP

Page 13: IBM Zurich Research Lab GmbH1 The Need for an Improved PAUSE Mitch Gusat and Cyriel Minkenberg IEEE 802 Dallas Nov. 2006

13

Link-Level FC will Back-Pressure: Whom? How Much? Whose 1st?

• Depends on the SE’s service discipline

• Most well-understood and used disciplines 1. Round-Robin

RR versions: strict (non-WC) and work-conserving (skip invalid queues)

2. FIFO, aka FCFS, aka EDF (timestamps, aging)3. Fair Queuing, WRR, WFQ

• A future 802.3x should standardize only the LL-FC not its ‘fairness’

bgnd + hot’ = .4 + .4

Buffers fill up

Stop2 ?

Stop1 ?

.8 + .4 = 1.2 > 1

hot” = .4

1.2

bgnd + hot’ = .4 + .4 .4.4 + .4 = .8

.4 + .4 = .8

Page 14: IBM Zurich Research Lab GmbH1 The Need for an Improved PAUSE Mitch Gusat and Cyriel Minkenberg IEEE 802 Dallas Nov. 2006

14

EDF-based BP: FCFS-type of Fairness (subset of max-min)

• New TX rates EDF-fair are backpropagated

λ’ = (1 - θ) * λ = 0.834 * λ

θ = 1- μj / (∑ λij) , incremental upstream traversal rooted on SE (L2,S3)

Hint: subtract the bgnd traffic λ = .4 from the EDF-fair rates and compare w/ previous hot rates

Obs.: If moderate-to-severe congestion θ->1 => λ’ -> 0 : Blocking spreads across all ingress branches => neither parking lot ‘unfairness’ nor flow decoupling is possible. (wide canopy saturation tree)

* All flows sharing resources along the hot paths are backpressured proportional to their respective contribution (not their traffic class). No flow isolation.

.4

.4

.4

.4

.4

.4

.4

.4

.4

.4

.4

.4

.4

+.1

+.1

+.1

+.1

+.1

+.1

+.1

+.1

+.2

+.2

+.2

+.2

+.4

+.4

S1

L4

L3

L2

L1

S2 S3

BP

1.0

.4

.734

.666BP

BP

.566

.5

BP

.417

.483

Page 15: IBM Zurich Research Lab GmbH1 The Need for an Improved PAUSE Mitch Gusat and Cyriel Minkenberg IEEE 802 Dallas Nov. 2006

15

RR-based BP: Prop. Fairness – Selective and Drastic

• New TX rates RR-fair are iteratively computed and backpropagated 1. identify the INs exceeding RR quota, as members of N’ ≤ N 2. distribute the overload δ across N’

δij’ = N*λij - μj / (N*N’), δij’ ≤ δ for work-conserving service 3. recompute the new admissible arrival rates

λij’ = λij - δij’ incrementally, upstream traversal rooted on SE (L2,S3)

3’. If strict RR no longer δij’ ≤ δ => the BP effects are drastic and focused!

Hint: subtract the bgdn traffic λ = .4 from the RR-fair rates and compare w/ previous hot rates

Obs. 1: Only the selected branch is BP-ed (discrimination) => RR-BP blocking always discriminates between ingress branches.Obs. 2: If severe congestion and/or many hops, selected branches will be swiftly choked down (bonsai – narrow trees).

.4

.4

.4

.4

.4

.4

.4

.4

.4

.4

.4

.4

.4

+.1

+.1

+.1

+.1

+.1

+.1

+.1

+.1

+.2

+.2

+.2

+.2

+.4

+.4

S1

L4

L3

L2

L1

S2 S3

BP

1.0

.4

.8

.6BP

BP

.6

.4

BP

.3

.5

/ .5

/ .25/ .15

Page 16: IBM Zurich Research Lab GmbH1 The Need for an Improved PAUSE Mitch Gusat and Cyriel Minkenberg IEEE 802 Dallas Nov. 2006

16

20% Overload - Reaction According to CM

• What’s the effect of CM only, if no LL-FC BP?

• Congestion factor cf=1.2 :

1. Marking by SE(L2, S3) is done at flow resolution (queue connection here) is based on SE queue occupancy and a set of thresholds (single one here, @8)

– if fair w/ p=1%, BCN marking is pro-rated 33% (bgnd) + 67% (hot)

2. ECA sources adapt their injection rate per e2e flow

• Desired result: convergence to proportionally fair stable rates

λbgnd + λCM_hot = O(.33 + .67) - achievable by fair marking by CPID, proper tuning of BCN params and enhancements to self-increase (see recent Stanford U. proposal)

Page 17: IBM Zurich Research Lab GmbH1 The Need for an Improved PAUSE Mitch Gusat and Cyriel Minkenberg IEEE 802 Dallas Nov. 2006

17

20% Overload - Reaction According to LL-FC

Strictly depending on the service discipline

802 shouldn’t mandate scheduling to switch vendors, because Round-Robin (RR: strict, or, work-conserving)

strong/prop. fairness decouples flows simple & scalable globally unfair (parking lot problem)

FIFO/EDF (timestamps) temporally & globally fair: first-come-first-served locally unfair => flow coupling (can’t isolate across partitions and clients) complex to scale

• BP will impact the speed, strength and locality (fairness) of backpressure... (underlying CM) hence different behaviors of the CM loop

Page 18: IBM Zurich Research Lab GmbH1 The Need for an Improved PAUSE Mitch Gusat and Cyriel Minkenberg IEEE 802 Dallas Nov. 2006

18

Observations

• PAUSE-induced deadlocks must be solved two solutions were proposed

• PAUSE + BCN: two feedback loops intercoupled BP/LL-FC modulates CM’s convergence:

+/- phase and amplitude depends on topology, RTTs, traffic and SE

• Switch service disciplines impact (via PAUSE) BCN’s stability margin and transient response

Switches w/ RR service may require higher gains for w and Gd , or a higher Ps, than switches using EDF

...how to signal this?

• CM should trigger earlier than BP => the two mechanisms, albeit ‘independent’ should be codesigned and co-tuned.

thresholds’ choice depends on link and e2e RTTs

Page 19: IBM Zurich Research Lab GmbH1 The Need for an Improved PAUSE Mitch Gusat and Cyriel Minkenberg IEEE 802 Dallas Nov. 2006

19

Instead of Conclusion: Improved PAUSE

• 10GigE is a discontinuity in the Ethernet evolution opportunity to address new needs and markets however, improvements are needed

• Requirements of next-generation PAUSE1. Correct by design, not implementation

1. Deadlock-free2. No HOL1- and, possibly reduced HOL2-blocking

Note: Do not try to address high-order HOL-blocking at link layer

2. Configurable for both lossy and lossless operation3. QoS / 802.1p support4. Enables virtualization / 802.1q 5. Beneficial or neutral to CM schemes (BCN, TCP, ...) 6. Legacy PAUSE-compatible7. Simple to understand and implement by designers

1. Min. no. of flow control domains: h/w queues and IDs in Ether-frame

8. Compelling to use => always enabled...!