yuanwei lu 1,2 , peng cheng 2, jiansong zhang , enhongchen ......yuanwei lu 1,2, guo chen3, bojie...

45
Yuanwei Lu 1,2 , Guo Chen 3 , Bojie Li 1,2 , Kun Tan 4 , Yongqiang Xiong 2 , Peng Cheng 2 , Jiansong Zhang 2 , Enhong Chen 1 , Thomas Moscibroda 5 1 USTC, 2 Microsoft Research, 3 Hunan University, 4 Huawei Technologies, 5 Microsoft Azure

Upload: others

Post on 10-Jul-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Yuanwei Lu1,2, Guo Chen3, Bojie Li1,2, Kun Tan4, Yongqiang Xiong2, Peng Cheng2, Jiansong Zhang2, Enhong Chen1, Thomas Moscibroda5

1USTC, 2Microsoft Research, 3Hunan University, 4Huawei Technologies, 5Microsoft Azure

• Low and stable base latency: <10μs

• High throughput with low CPU overhead

• e.g. Half of the traffic in a online service datacenter is RDMA [1]

• Lots of parellel paths between any pair of nodes

• Hot spot caused high latency and low throughput• Not robust to failure

[1] Guo, Chuanxiong, et al. "RDMA over Commodity Ethernet at Scale. "Proceedings of the 2016 conference on ACM SIGCOMM 2016 Conference. ACM, 2016.

TCP/IP RDMA

Multi-path TCP

Multi-path Transport ?

[1] Kalia, Anuj, Michael Kaminsky, and David G. Andersen. "Design Guidelines for High Performance RDMA Systems." 2016 USENIX Annual Technical Conference (USENIX ATC 16). 2016.

MP-RDMA design needs to add as little hardware memory footprint as possible compared with existing single-path RDMA

#2: How to track out-of-order packets with small on-chip memory footprint?

#3: How to provide message ordering without hurting performance?

• Tracking the transmission and congestion states on every paths

Challenge #1: How to achieve congestion-aware traffic split without per-path states at NIC?

• Insight: window-based protocol with ACK clocking

cWnd 6Inflight 0aWnd 6

cWnd 6Inflight 6aWnd 0

cWnd 6Inflight 6aWnd 0

ECN ECN

cWnd 6Inflight 6aWnd 0ECN

ECN

1

2

cWnd 5.5Inflight 5aWnd 0.5ECN

ECN

1

2

cWnd 5.5Inflight 5aWnd 0.5ECN2

cWnd 5Inflight 4aWnd 1ECN2

cWnd 5Inflight 5aWnd 0

cWnd 5Inflight 3aWnd 2

cWnd 5Inflight 5aWnd 0

cWnd 5Inflight 5aWnd 0

#1: How to achieve congestion-aware traffic split without per-path states at NIC?

#2: How to track out-of-order packets with small on-chip memory footprint?

#3: How to provide message ordering without hurting performance?

• Need meta data to track out-of-order

• Large under certain scenarios • e.g. Link downgrade: 1.2KB bitmap • e.g. PFC pause: infinite

Challenge #2: How to track out-of-order packets with small on-chip memory footprint?

good pathsbad paths

∆ 3snd_ooh 0snd_ool 0 1 2

3 4

5 6

∆ 3snd_ooh 0snd_ool 0 1 2

3 4

5 6

∆ 3snd_ooh 4snd_ool 1 1 2

3 4

5 6

∆ 3snd_ooh 6snd_ool 3 1 2

5 6 7 8

∆ 3snd_ooh 6snd_ool 3

1 2

7 8

9 10

1 2

∆ 3snd_ooh 6snd_ool 3

1 2

7 8

9 10

1 2Drop!

#1: How to achieve congestion-aware traffic split without per-path states at NIC?

#2: How to track out-of-order packets with small on-chip memory footprint?

#3: How to provide message ordering without hurting performance?

• Problem: application may assume in-order messages delivery• Naïve solution: stop and wait • Throughput downgrade

WRITE 1: Data WRITE 2: meta data

Application poll on meta data

Message 1 Message 2

Challenge #3: How to provide message ordering without hurting the application performance?

• A synchronise message is guaranteed to arrive after all previous messages

• A synchronise message just wait ∆! before previous messages

• ∆!=α∗∆/(#$%&/'((), large enough to avoid out-of-order

• Altera Stratix V D5 FPGA board with two 40Gb Ethernet ports• 14 elements, ~2K lines of OpenCL code

• Dependency: 9 cycles • Clock Frequency: ~206MHz

• Dependency: 3 cycles • Clock Frequency: ~196MHz

[1] Li, Bojie, et al. "ClickNP: Highly flexible and High-performance Network Processing with Reconfigurable Hardware." ACM SIGCOMM 2016 Conference.

Conclusion: MP-RDMA can be implemented in hardware with high performance

• 47.05% higher application throughput v.s. DCQCN

• Under loss: > 3x throughput improvement

• No per-path states • Out-of-order aware path selection • Synchronise mechanism to ensure message ordering