r2d2 reliable and rapid data delivery for dcs
DESCRIPTION
Berk Atikoglu, Mohammad Alizadeh , Tom Yue, Balaji Prabhakar , Mendel Rosenblum. R2D2 Reliable and Rapid Data Delivery for DCs. Motivation. Unreliable packet delivery due to Corruption Dealt with via retransmission Congestion Particularly bad due to incast or fan-in congestion - PowerPoint PPT PresentationTRANSCRIPT
R2D2Reliable and Rapid Data Delivery for DCs
Berk Atikoglu, Mohammad Alizadeh, Tom Yue, Balaji Prabhakar, Mendel Rosenblum
2
Motivation Unreliable packet delivery due to
Corruption▪ Dealt with via retransmission
Congestion▪ Particularly bad due to incast or fan-in congestion
These losses increase difficulty of reliable transmission Loss of throughput Increase in flow transfer times
Incast
The client sends a request to several servers.
The responses travel to the switch simultaneously.
The switch buffer overflows from the amount of data. Some packets are dropped.
3
1
S S S S S
C 2C
S S S S S
3
1
2
S S S S S
C 3
4
Existing Approaches High-resolution timers
Reduce retransmission timeouts (RTO) to hundreds of µs▪ Proposed in Vasudevan et al (Sigcomm 2009); see also Chen et al (WREN
2009) Large number of CPU cycles on rapid interrupts or timer
programming In virtualized environments, high cost of processing hardware
interrupts means even higher overhead
Large switch buffers Reduce incast occurences by caching enough packets Increased packet latency Complex implementation Large caches are expensive Increased power usage
5
Our Approach: R2D2 R2D2: collapse all flows into a single “meta-flow”
Single wait queue holds packets sent by host that are not yet acked
Single retransmission timer, no per-flow state Provides reliable packet delivery Resides in Layer 2.5, a shim layer between Layer 2 and Layer 3
Key observation: Exploit uniformity of Data Center environments Path lengths between hosts are small (3 – 5 hops) RTTs are small (100 – 400 µs) Path bandwidths are uniformly high (1Gbps, 10Gbps) Therefore, amount of data from a 1G/10G source “in flight” is less
than 64/640 KB Store source packets in R2D2 on-the-fly, rapidly retransmit
dropped or corrupted packets
6
TCP
L3 L2
L2 L3
7
L2.5
R2D2
L3 L2
L2 L3
R2D2
Layer 3
Layer 2.5R2D2
sender
Layer 2
When a flow times out: Retransmit first un-ACKed
packet (fill the hole). Back-off: double the flow’s
timeout value. When an ACK comes
in: Reset the timeout back-
off.
1
2
3
4
Outbound packet is intercepted by R2D2.
A timer is started. A copy of the packet is
placed in the wait queue.
The returned TCP ack removes all ACKed packets held in the wait queue.
1
23
4
8
9
Features Reliable, but not guaranteed, delivery
Maximum number of retransmissions before giving up
State-sharing Only one wait queue; all packets go in same queue
No change to network stack Kernel module in Linux; driver in Windows Hardware version is OS-independent
Incremental deployability Possible to protect a subset of flows
10
Implementation Implemented as a Linux Kernel Module on
Kernel 2.6.* No need to modify kernel Can be loaded/unloaded easily
Incoming/outgoing TCP/IP packets are captured using Netfilter
Captured packets are put into a queue just meta-data is kept in queue; packet is
cloned L2.5 thread processes the packets in the
queue periodically
Test Setup
48 Dell PowerEdge 2950 Servers Intel Core 2 Quad Q9550 × 2 16GB ECC DRAM Broadcom NetXtreme II 5708
1GbE NIC CentOS 5.3 Final; Linux 2.6.28-
10
Switches Netgear GS748TNA (48 ports,
GbE) Cisco Catalyst 4948 (48 ports,
GbE) BNT RackSwitch G8421 (24
ports, 10GbE) 11
…
1GbE / 10GbE1 rack48 servers
12
Algorithms
R2D2 Minimum timeout: 3ms Max retransmissions: 10 Delayed ack disabled
TCP: CUBIC TCP minRTO: 200ms Segmentation offloading: disabled TCP timestamps: disabled
13
Workload – 1 GbE switches Number of servers (N): 1, 2, 4, 8, 16, 32, 46
File size (S): 1MB, 20MB
Client: requests (S/N) MB from each server Issues new request when all servers respond
Measurements: Goodput Retransmission ratio:
Retransmitted packets
Total packets sent by TCP
14
Netgear Test – Goodput
1MB 20MB
1 2 4 8 16 32 460100200300400500600700800900
1000
R2D2TCP
Servers
Goo
dput
(Mbp
s)
1 2 4 8 16 32 460100200300400500600700800900
1000
R2D2TCP
Servers
Goo
dput
(Mbp
s)
15
Netgear Test – Retransmission Ratio
1 2 4 8 16 32 460
0.001
0.002
0.003
0.004
0.005
0.006
0.007
Servers
Retr
ansm
issi
on R
atio
1 2 4 8 16 32 460
0.001
0.002
0.003
0.004
0.005
0.006
Servers
Retr
ansm
issi
on R
atio
1MB 20MB
16
Netgear Test – Multiple Clients
6 clients (instead of 1 client) 32 servers Each client requests a file from each of the
32 servers
R2D2 TCP0200400600800
1000
Test
Goo
dput
(Mbp
s)
1MB 20MB
R2D2 TCP0200400600800
1000
Test
Goo
dput
(Mbp
s)
17
Catalyst 4948 Test – Goodput
1 2 4 8 16 32 460100200300400500600700800900
1000
R2D2TCP
Servers
Goo
dput
(Mbp
s)
1 2 4 8 16 32 460100200300400500600700800900
1000
R2D2TCP
Servers
Goo
dput
(Mbp
s)
1MB 20MB
18
Catalyst 4948 Test – Retransmission Ratio
1 2 4 8 16 32 460
0.01
0.02
0.03
0.04
0.05
0.06
Servers
Retr
ansm
issi
on R
atio
1 2 4 8 16 32 460
0.001
0.002
0.003
0.004
0.005
Servers
Retr
ansm
issi
on R
atio
1MB 20MB
19
Catalyst 4948 Test – Multiple Clients
R2D2 TCP0100200300400500600700800900
1000
123456
Test
Goo
dput
(Mbp
s)
R2D2 TCP0100200300400500600700800900
1000
123456
Test
Goo
dput
(Mbp
s)
1MB 20MB
20
10GbE test – Goodput
File size: 10MB Number of servers:
1, 5, 9, 13, 17, 21
1 5 9 13 17 210100020003000400050006000700080009000
R2D2TCP
Servers
Goo
dput
(Mbp
s)
1 5 9 13 17 210
0.00050.001
0.00150.002
Servers
Retr
ansm
issi
on
Ratio
21
Conclusion R2D2 is scalable and fast, provides reliable
delivery No need to modify kernel Can be loaded/unloaded easily Improves reliability in data center networks
Hardware implementation in NIC can be much faster Work well with TCP offload options like
segmentation and checksum offloading Developing an FPGA implementation