detail - sigcomm...why the tail matters •recall: 100’sof data retrievals per page creation ......

33
DeTail Reducing the Tail of Flow Completion Times in Datacenter Networks David Zats, Tathagata Das, Prashanth Mohan, Dhruba Borthakur, Randy Katz 1

Upload: others

Post on 08-Jul-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: DeTail - SIGCOMM...Why the Tail Matters •Recall: 100’sof data retrievals per page creation ... •Under the RTT distribution, 150 data retrievals take 200ms (ignoring computation

DeTailReducing the Tail of Flow Completion Times

in Datacenter Networks

David Zats, Tathagata Das, Prashanth Mohan, Dhruba Borthakur, Randy Katz

1

Page 2: DeTail - SIGCOMM...Why the Tail Matters •Recall: 100’sof data retrievals per page creation ... •Under the RTT distribution, 150 data retrievals take 200ms (ignoring computation

A Typical Facebook Page

2

Modern pages have many components

Page 3: DeTail - SIGCOMM...Why the Tail Matters •Recall: 100’sof data retrievals per page creation ... •Under the RTT distribution, 150 data retrievals take 200ms (ignoring computation

Creating a Page

DatacenterNetwork

Internet

Front End

News Feed

Search

Ads

Chat

3

Page 4: DeTail - SIGCOMM...Why the Tail Matters •Recall: 100’sof data retrievals per page creation ... •Under the RTT distribution, 150 data retrievals take 200ms (ignoring computation

What’s Required?

• Servers must perform 100’s of data retrievals*

– Many of which must be performed serially

• While meeting a deadline of 200-300ms**

– SLA measured at the 99.9th percentile**

• Only have 2-3ms per data retrieval

– Including communication and computation

*The Case for RAMClouds *SIGOPS’09+

**Better Never than Late *SIGCOMM’11+ 4

Page 5: DeTail - SIGCOMM...Why the Tail Matters •Recall: 100’sof data retrievals per page creation ... •Under the RTT distribution, 150 data retrievals take 200ms (ignoring computation

What is the Network’s Role?

• Analyzed distribution of RTT measurements:

• Median RTT takes 334μs, but 6% take over 2ms

• Can be as high as 14ms

5

Source: Data Center TCP (DCTCP) *SIGCOMM’10+

Network delays alone can consume the data retrieval’s time budget

Page 6: DeTail - SIGCOMM...Why the Tail Matters •Recall: 100’sof data retrievals per page creation ... •Under the RTT distribution, 150 data retrievals take 200ms (ignoring computation

Why the Tail Matters

• Recall: 100’s of data retrievals per page creation

• The unlikely event of a data retrieval taking too long is likely to happen on every page creation

– Data retrieval dependencies can magnify impact

6

Page 7: DeTail - SIGCOMM...Why the Tail Matters •Recall: 100’sof data retrievals per page creation ... •Under the RTT distribution, 150 data retrievals take 200ms (ignoring computation

Impact on Page Creation

• Under the RTT distribution, 150 data retrievals take 200ms (ignoring computation time)

As Facebook already at 130 data retrievals per page,need to address network delays

7

Page 8: DeTail - SIGCOMM...Why the Tail Matters •Recall: 100’sof data retrievals per page creation ... •Under the RTT distribution, 150 data retrievals take 200ms (ignoring computation

App-Level Mitigation

• Use timeouts & retries for critical data retrievals

– Inefficient because of high network variance

– Choose from conservative timeouts and long delays or tight timeouts and increased server load

• Hide the problem from the user

– By caching and serving stale data

– Rendering pages incrementally

– User often notices, becomes annoyed / frustrated

Need to focus on the root cause8

Page 9: DeTail - SIGCOMM...Why the Tail Matters •Recall: 100’sof data retrievals per page creation ... •Under the RTT distribution, 150 data retrievals take 200ms (ignoring computation

Outline

• Causes of long data retrieval times

• Cutting the tail with DeTail

• Evaluation

9

Page 10: DeTail - SIGCOMM...Why the Tail Matters •Recall: 100’sof data retrievals per page creation ... •Under the RTT distribution, 150 data retrievals take 200ms (ignoring computation

Causes of Long Data Retrieval Times

• Data retrievals are short, highly variable flows

– Typically under 20KB in size, with many under 2KB*

• Short flows provide insufficient information for transport to agilely respond to packet drops

• Variable flow sizes decrease efficacy of network-layer load balancers

*Data Center TCP (DCTCP) *SIGCOMM’10+

10

Page 11: DeTail - SIGCOMM...Why the Tail Matters •Recall: 100’sof data retrievals per page creation ... •Under the RTT distribution, 150 data retrievals take 200ms (ignoring computation

Transport Layer Response

Timeout

Transport does not have sufficient information to respond agilely

11

Page 12: DeTail - SIGCOMM...Why the Tail Matters •Recall: 100’sof data retrievals per page creation ... •Under the RTT distribution, 150 data retrievals take 200ms (ignoring computation

Network Layer Load Balancers

• Expected to support single-path assumption

• Common approach: hash flows to paths

– Does not consider flow size or sending rate

• Results in uneven load spreading

– Leads hotspots and increased queuing delays

The single-path assumption restricts the abilityto agilely balance load

12

Page 13: DeTail - SIGCOMM...Why the Tail Matters •Recall: 100’sof data retrievals per page creation ... •Under the RTT distribution, 150 data retrievals take 200ms (ignoring computation

Recent Proposals

• Reduce packet drops

– By cross-flow learning [DCTCP] or explicit flow scheduling [D3]

– Maintain the single-path assumption

• Adaptively move traffic

– By creating subflows [MPTCP] or periodically remapping flows [Hedera]

– Not sufficiently agile to support short flows

13

Page 14: DeTail - SIGCOMM...Why the Tail Matters •Recall: 100’sof data retrievals per page creation ... •Under the RTT distribution, 150 data retrievals take 200ms (ignoring computation

Outline

• Causes of long data retrieval times

• Cutting the tail with DeTail

• Evaluation

14

Page 15: DeTail - SIGCOMM...Why the Tail Matters •Recall: 100’sof data retrievals per page creation ... •Under the RTT distribution, 150 data retrievals take 200ms (ignoring computation

DeTail Stack

• Use in-network mechanisms to maximize agility

• Remove restrictions that hinder performance

• Well-suited for datacenters

– Single administrative domain

– Reduced backward compatibility requirements

15

Page 16: DeTail - SIGCOMM...Why the Tail Matters •Recall: 100’sof data retrievals per page creation ... •Under the RTT distribution, 150 data retrievals take 200ms (ignoring computation

Hop-by-hop Push-back

• Agile link-layer response to prevent packet drops

16

What about head-of-line blocking?

Page 17: DeTail - SIGCOMM...Why the Tail Matters •Recall: 100’sof data retrievals per page creation ... •Under the RTT distribution, 150 data retrievals take 200ms (ignoring computation

Adaptive Load Balancing

• Agile network-layer approach for balancing load

17

Synergistic relationship: local output queues indicatedownstream congestion because of push-back

Page 18: DeTail - SIGCOMM...Why the Tail Matters •Recall: 100’sof data retrievals per page creation ... •Under the RTT distribution, 150 data retrievals take 200ms (ignoring computation

Load Balancing Efficiently

• DC flows have varying timeliness requirements*

– How to efficiently consider packet priority?

• Compare queue occupancies for every decision

– How to efficiently compare many of them?

*Data Center TCP (DCTCP) *SIGCOMM’10+ 18

Page 19: DeTail - SIGCOMM...Why the Tail Matters •Recall: 100’sof data retrievals per page creation ... •Under the RTT distribution, 150 data retrievals take 200ms (ignoring computation

Priority in Load Balancing

Output Queue 1

High Priority

Low Priority

Output Queue 2Arriving Packet

Based on queueoccupancy

Ideal

How to enqueue packet so it is sent soonest?

19

Page 20: DeTail - SIGCOMM...Why the Tail Matters •Recall: 100’sof data retrievals per page creation ... •Under the RTT distribution, 150 data retrievals take 200ms (ignoring computation

Priority in Load Balancing

• Approach: track how many bytes to be sentbefore new packet

• Use per-priority counters

– Update on each packet enqueue/dequeue

– Compare counters to find least occupied port

20

Page 21: DeTail - SIGCOMM...Why the Tail Matters •Recall: 100’sof data retrievals per page creation ... •Under the RTT distribution, 150 data retrievals take 200ms (ignoring computation

Comparing Queue Occupancies

• Many counter comparisons required for every forwarding decision

• Want to efficiently pick the least occupied port

– Pre-computation is hard as solution is destination, time dependent

21

Page 22: DeTail - SIGCOMM...Why the Tail Matters •Recall: 100’sof data retrievals per page creation ... •Under the RTT distribution, 150 data retrievals take 200ms (ignoring computation

Use Per-Counter Thresholding

• Pick a good port, instead of the best one

Queues < TPacketPriority

1011

FavoredPorts

Dest. Address Forwarding Entry 0101

AcceptablePorts

& 0001

SelectedPort

22

Page 23: DeTail - SIGCOMM...Why the Tail Matters •Recall: 100’sof data retrievals per page creation ... •Under the RTT distribution, 150 data retrievals take 200ms (ignoring computation

Reorder-Resistant Transport

• Handle packet reordering due to load balancing

– Disable TCP’s fast recovery and fast retransmission

• Respond to congestion (no more packet drops)

– Monitor output queues and use ECN to throttle flows

23

Page 24: DeTail - SIGCOMM...Why the Tail Matters •Recall: 100’sof data retrievals per page creation ... •Under the RTT distribution, 150 data retrievals take 200ms (ignoring computation

DeTail Stack

Application

Transport

Network

Link

Physical

Layer Component

Hop-by-hop Push-back

Adaptive Load Balancing

Reorder-Resistant Transport

Function

Prevent packet drops

Evenly balance load

Support lower layers

24

Page 25: DeTail - SIGCOMM...Why the Tail Matters •Recall: 100’sof data retrievals per page creation ... •Under the RTT distribution, 150 data retrievals take 200ms (ignoring computation

Outline

• Causes of long data retrieval times

• Cutting the tail with DeTail

• Evaluation

25

Page 26: DeTail - SIGCOMM...Why the Tail Matters •Recall: 100’sof data retrievals per page creation ... •Under the RTT distribution, 150 data retrievals take 200ms (ignoring computation

Simulation and Implementation

• NS-3 simulation

• Click implementation

– Drivers and NICs buffer hundreds of packets

– Must rate-limit Click to underflow buffers

26

Page 27: DeTail - SIGCOMM...Why the Tail Matters •Recall: 100’sof data retrievals per page creation ... •Under the RTT distribution, 150 data retrievals take 200ms (ignoring computation

Topology

• FatTree: 128-server (NS-3) / 16-server (Click)

• Oversubscription factor of 4x

Cores

Aggs

TORs

Reproduced From: A Scalable Commodity Datacenter Network Architecture *SIGCOMM’08+27

Page 28: DeTail - SIGCOMM...Why the Tail Matters •Recall: 100’sof data retrievals per page creation ... •Under the RTT distribution, 150 data retrievals take 200ms (ignoring computation

Setup

• Baseline

– TCP NewReno

– Flow hashing based on IP headers

– Prioritization of data retrievals vs. background

• Metric

– Reduction in 99.9th percentile completion time

28

Page 29: DeTail - SIGCOMM...Why the Tail Matters •Recall: 100’sof data retrievals per page creation ... •Under the RTT distribution, 150 data retrievals take 200ms (ignoring computation

Page Creation Workload

• Retrieval size: 2, 4, 8, 16, 32 KB*

• Background traffic: 1MB flows

*Covers range of query traffic sizes reported by DCTCP

DeTail reduces 99.9th percentile page creationtime by over 50%

29

Page 30: DeTail - SIGCOMM...Why the Tail Matters •Recall: 100’sof data retrievals per page creation ... •Under the RTT distribution, 150 data retrievals take 200ms (ignoring computation

Is the Whole Stack Necessary?

• Evaluated push-back w/o adaptive load balancing

– Performs worse than baseline

30

DeTail’s mechanisms work together, overcomingtheir individual limitations

Page 31: DeTail - SIGCOMM...Why the Tail Matters •Recall: 100’sof data retrievals per page creation ... •Under the RTT distribution, 150 data retrievals take 200ms (ignoring computation

What About Link Failures?

• 10s of link failures occur per day*

– Creates permanent network imbalance

• Example

– Core-AGG link degrades from 1Gbps to 100Mbps

– DeTail achieves 91% reduction in the 99.9th percentile

DeTail effectively moves traffic away from failures,appropriately balancing load

*Understanding Network Failures in Data Centers *SIGCOMM’11+31

Page 32: DeTail - SIGCOMM...Why the Tail Matters •Recall: 100’sof data retrievals per page creation ... •Under the RTT distribution, 150 data retrievals take 200ms (ignoring computation

What About Long Background Flows?

• Background Traffic: 1, 16, 64MB flows*

• Light data retrieval traffic

DeTail’s adaptive load balancing also helps long flows

*Covers range of update flow sizes reported by DCTCP32

Page 33: DeTail - SIGCOMM...Why the Tail Matters •Recall: 100’sof data retrievals per page creation ... •Under the RTT distribution, 150 data retrievals take 200ms (ignoring computation

Conclusion

• Long tail harms page creation

– The extreme case becomes the common case

– Limits number of data retrievals per page

• The DeTail stack improves long tail performance

– Can reduce the 99.9th percentile by more than 50%

33