Practical TDMA for Datacenter Ethernet
Bhanu C. Vattikonda, George Porter, Amin Vahdat, Alex C. Snoeren
Performance depends on throughput sensitive traffic in shuffle phase
Generate latency sensitive traffic
All-to-all Gather/Scatter
Variety of applications hosted in datacenters
Network is treated as a black-box
3
Applications like Hadoop MapReduce perform inefficiently
Applications like Memcached experience high latency
Why does the lack of coordination
hurt performance?
Example datacenter scenario
4
Traffic receiver
Bulk transfer
Latency sensitive
Bulk transfer
•Bulk transfer
•Latency sensitive
Drops and queuing lead to poor performance
5
Traffic receiver
Bulk transfer
Latency sensitive
Bulk transfer
•Bulk transfer traffic experiences packet drops
•Latency sensitive traffic gets queued in the buffers
Current solutions do not take a holistic approach
6
Facebook uses a custom UDP based transport protocol
Alternative transport protocols like DCTCP address TCP shortcomings
Infiniband, Myrinet offer boutique hardware solutions to address these problems but are expensive
Since the demand can be anticipated, can we coordinate
hosts?
Taking turns to transmit packets
7
Receiver
Bulk transfer
Latency sensitive
Bulk transfer
TIME DIVISION MULTIPLE ACCESS
Enforcing TDMA is difficult
9
It is not practical to task hosts with keeping track of time and controlling transmissions
End host clocks quickly go out of synchronization
Existing TDMA solutions need special support
10
Since end host clocks cannot be synchronized, special support is needed from the network
FTT-Ethernet, RTL-TEP, TT-Ethernet require modified switching hardware
Even with special support, the hosts need to run real time operating systems to enforce TDMA
FTT-Ethernet, RTL-TEP
Can we do TDMA with commodity
Ethernet?
TDMA using Pause Frames
11
Flow control packets (pause frames) can be used to control Ethernet transmissions
Pause frames are processed in hardware
Very efficient processing of the flow control packets
Blast UDP packets
802.3x Pause frames
Measure time taken by sender to react to the pause frames
TDMA using Pause Frames
12
Pause frames processed in hardware
Very efficient processing of the flow control packets
* Measurement done using 802.3x pause frames
• Reaction time to pause frames is 2 – 6 μs
• Low variance
TDMA using commodity hardware
13
Collect demand information from the
end hosts
Control end host transmissions
Compute the schedule for communication
TDMA imposed over Ethernet using a centralized fabric
manager
Collect demand information from the
end hosts
Compute the schedule for communication
S –> D1: 1MB
S –> D2: 1MB
round2round1
TDMA example
14
S
D1
D2
Fabric manager
Collect demand information from the
end hosts
Control end host transmissions
Compute the schedule for communication
• round1: S -> D1
• round2: S -> D2
• round3: S -> D1
• round4: S -> D2
•…
Schedule
More than one host
Fabricmanager
•Control packets should be processed with low variance
•Control packets should arrive at the end hosts synchronously
round2round2round2round2round1round1round1round1
Synchronized arrival of control packets
16
We cannot directly measure the synchronous arrival
Difference in arrival of a pair of control packets at 24 hosts
Synchronized arrival of control packets
17
Difference in arrival of a pair of control packets at 24 hosts
Variation of ~15μs for different sending rates at end hosts
Ideal scenario: control packets arrive synchronously
Round 1 Round 2 Round 3Host A
Host B Round 1 Round 2 Round 3
round2
round2
round3
round3
18
Experiments show that packets do not arrive synchronously
Round 1 Round 2 Round 3
Round 1 Round 2 Round 3
Host A
Host B
Out of sync by <15μs
round2
round2
19
Guard times to handle lack of synchronization
Round 1
Round 1
Host A
Host B
Guard times (15μs) handle out of sync control
packets
Round 2
Round 2
Round 3
Round 3
Stop
Stop
round2
round2
20
TDMA for Datacenter Ethernet
21
Control end host transmissions
•Use flow control packets to achieve low variance
•Guard times adjust for variance in control packet arrival
Encoding scheduling information
22
We use IEEE 802.1Qbb priority flow control frames to encode scheduling informationUsing iptables rules, traffic for different destinations can be classified into different Ethernet classes
802.1Qbb priority flow control frames can then be used to selectively start transmission of packets to a destination
Methodology to enforce TDMA slots
23
Pause all traffic
Un-pause traffic to a particular destination
Pause all traffic to begin the guard time
Evaluation
24
MapReduce shuffle phaseAll to all transfer
Memcached like workloads Latency between nodes in a mixed environment in presence of background flows
Hybrid electrical and optical switch architectures
Performance in dynamic network topologies
Experimental setup
25
24 serversHP DL380Dual Myricom 10G NICs with kernel bypass to access packets
1 Cisco Nexus 5000 series 10G96-port switch,1 Cisco Nexus 5000 series 10G 52-port switch
300μs TDMA slot and 15μs guard time
Effective 5% overhead
All to all transfer in multi-hop topology
27
•10GB all to all transfer
•We use a simple round robin scheduler at each level
•5% inefficiency owing to guard time
8 Hosts
8 Hosts
8 Hosts
Ideal transfer time: 1024s
TCP all to all
TDMA all to all
Latency in the presence of background flows
28
Receiver
Bulk transfer
Latency sensitive
Bulk transfer
•Start both bulk transfers
•Measure latency between nodes using UDP
Latency in the presence of background flows
29
•Latency between the nodes in presence of TCP flows is high and variable
•TDMA system achieves lower latency
TCP
TDMA
TDMA with Kernel bypass
Adapting to dynamic network configurations
•Link capacity between the hosts is varied between 10Gbps and 1Gbps every 10ms
Sender
Receiver
Ideal performance
Adapting to dynamic network configurations
•Link capacity between the hosts is varied between 10Gbps and 1Gbps every 10ms
Sender
Receiver
TCP performance
Adapting to dynamic network configurations
33
TDMA better suited since it prevents packet losses
TCP performance
Conclusion
34
TDMA can be achieved using commodity hardwareLeverage existing Ethernet standards
TDMA can lead to performance gains in current networks
15% shorter finish times for all to all transfers3x lower latency
TDMA is well positioned for emerging network architectures which use dynamic topologies
2.5x throughput improvement in dynamic network settings