profiling network performance in multi-tier datacenter applications
DESCRIPTION
Profiling Network Performance in Multi-tier Datacenter Applications. Scalable Net -App Profiler. Minlan Yu Princeton University. Joint work with Albert Greenberg, Dave Maltz , Jennifer Rexford, Lihua Yuan, Srikanth Kandula , Changhoon Kim. Applications inside Data Centers. …. …. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Profiling Network Performance in Multi-tier Datacenter Applications](https://reader036.vdocuments.mx/reader036/viewer/2022081514/56816547550346895dd7bfd7/html5/thumbnails/1.jpg)
1
Profiling Network Performancein Multi-tier Datacenter Applications
Minlan YuPrinceton University
Joint work with Albert Greenberg, Dave Maltz, Jennifer Rexford, Lihua Yuan, Srikanth Kandula, Changhoon Kim
Scalable Net-App Profiler
![Page 2: Profiling Network Performance in Multi-tier Datacenter Applications](https://reader036.vdocuments.mx/reader036/viewer/2022081514/56816547550346895dd7bfd7/html5/thumbnails/2.jpg)
2
Applications inside Data Centers
Front end Server
Aggregator Workers
….
…. …. ….
![Page 3: Profiling Network Performance in Multi-tier Datacenter Applications](https://reader036.vdocuments.mx/reader036/viewer/2022081514/56816547550346895dd7bfd7/html5/thumbnails/3.jpg)
3
Challenges of Datacenter Diagnosis
• Large complex applications– Hundreds of application components– Tens of thousands of servers
• New performance problems– Update code to add features or fix bugs– Change components while app is still in operation
• Old performance problems (Human factors)– Developers may not understand network well – Nagle’s algorithm, delayed ACK, etc.
![Page 4: Profiling Network Performance in Multi-tier Datacenter Applications](https://reader036.vdocuments.mx/reader036/viewer/2022081514/56816547550346895dd7bfd7/html5/thumbnails/4.jpg)
4
Diagnosis in Today’s Data Center
Host
App
OS Packet sniffer
App logs:#Reqs/secResponse time1% req. >200ms delay
Switch logs:#bytes/pkts per minute
Packet trace:Filter out trace for long delay req.
SNAP:Diagnose net-app interactions
Application-specific
Too expensive
Too coarse-grainedGeneric, fine-grained, and lightweight
![Page 5: Profiling Network Performance in Multi-tier Datacenter Applications](https://reader036.vdocuments.mx/reader036/viewer/2022081514/56816547550346895dd7bfd7/html5/thumbnails/5.jpg)
5
SNAP: A Scalable Net-App Profiler
that runs everywhere, all the time
![Page 6: Profiling Network Performance in Multi-tier Datacenter Applications](https://reader036.vdocuments.mx/reader036/viewer/2022081514/56816547550346895dd7bfd7/html5/thumbnails/6.jpg)
6
SNAP Architecture
At each host for every connection
Collect data
![Page 7: Profiling Network Performance in Multi-tier Datacenter Applications](https://reader036.vdocuments.mx/reader036/viewer/2022081514/56816547550346895dd7bfd7/html5/thumbnails/7.jpg)
7
Collect Data in TCP Stack
• TCP understands net-app interactions– Flow control: How much data apps want to read/write– Congestion control: Network delay and congestion
• Collect TCP-level statistics– Defined by RFC 4898– Already exists in today’s Linux and Windows OSes
![Page 8: Profiling Network Performance in Multi-tier Datacenter Applications](https://reader036.vdocuments.mx/reader036/viewer/2022081514/56816547550346895dd7bfd7/html5/thumbnails/8.jpg)
8
TCP-level Statistics
• Cumulative counters– Packet loss: #FastRetrans, #Timeout– RTT estimation: #SampleRTT, #SumRTT– Receiver: RwinLimitTime– Calculate the difference between two polls
• Instantaneous snapshots– #Bytes in the send buffer– Congestion window size, receiver window size– Representative snapshots based on Poisson sampling
![Page 9: Profiling Network Performance in Multi-tier Datacenter Applications](https://reader036.vdocuments.mx/reader036/viewer/2022081514/56816547550346895dd7bfd7/html5/thumbnails/9.jpg)
9
SNAP Architecture
At each host for every connection
Collect data
Performance Classifier
![Page 10: Profiling Network Performance in Multi-tier Datacenter Applications](https://reader036.vdocuments.mx/reader036/viewer/2022081514/56816547550346895dd7bfd7/html5/thumbnails/10.jpg)
10
Life of Data Transfer
• Application generates the data
• Copy data to send buffer
• TCP sends data to the network
• Receiver receives the data and ACK
Sender App
Send Buffer
Receiver
Network
![Page 11: Profiling Network Performance in Multi-tier Datacenter Applications](https://reader036.vdocuments.mx/reader036/viewer/2022081514/56816547550346895dd7bfd7/html5/thumbnails/11.jpg)
11
Taxonomy of Network Performance
– No network problem
– Send buffer not large enough
– Fast retransmission – Timeout
– Not reading fast enough (CPU, disk, etc.)– Not ACKing fast enough (Delayed ACK)
Sender App
Send Buffer
Receiver
Network
![Page 12: Profiling Network Performance in Multi-tier Datacenter Applications](https://reader036.vdocuments.mx/reader036/viewer/2022081514/56816547550346895dd7bfd7/html5/thumbnails/12.jpg)
12
Identifying Performance Problems
– Not any other problems
– #bytes in send buffer
– #Fast retransmission– #Timeout
– RwinLimitTime– Delayed ACKdiff(SumRTT) > diff(SampleRTT)*MaxQueuingDelay
Sender App
Send Buffer
Receiver
NetworkDirect measure
Sampling
Inference
![Page 13: Profiling Network Performance in Multi-tier Datacenter Applications](https://reader036.vdocuments.mx/reader036/viewer/2022081514/56816547550346895dd7bfd7/html5/thumbnails/13.jpg)
13
Management System
SNAP Architecture
At each host for every connection
Collect data
Performance Classifier
Cross-connection correlation
Topology, routingConn proc/app
Offending app, host, link, or switch
![Page 14: Profiling Network Performance in Multi-tier Datacenter Applications](https://reader036.vdocuments.mx/reader036/viewer/2022081514/56816547550346895dd7bfd7/html5/thumbnails/14.jpg)
14
Pinpoint Problems via Correlation
• Correlation over shared switch/link/host– Packet loss for all the connections going through
one switch/host– Pinpoint the problematic switch
![Page 15: Profiling Network Performance in Multi-tier Datacenter Applications](https://reader036.vdocuments.mx/reader036/viewer/2022081514/56816547550346895dd7bfd7/html5/thumbnails/15.jpg)
15
Pinpoint Problems via Correlation
• Correlation over application– Same application has problem on all machines– Report aggregated application behavior
![Page 16: Profiling Network Performance in Multi-tier Datacenter Applications](https://reader036.vdocuments.mx/reader036/viewer/2022081514/56816547550346895dd7bfd7/html5/thumbnails/16.jpg)
16
Management System
SNAP Architecture
At each host for every connection
Collect data
Performance Classifier
Cross-connection correlation
Topology, routingConn proc/app
Offending app, host, link, or switch
Online, lightweight processing & diagnosis
Offline, cross-conn diagnosis
![Page 17: Profiling Network Performance in Multi-tier Datacenter Applications](https://reader036.vdocuments.mx/reader036/viewer/2022081514/56816547550346895dd7bfd7/html5/thumbnails/17.jpg)
17
Reducing SNAP Overhead
• SNAP overhead– Data volume: 120 Bytes per connection per poll– CPU overhead: • 5% for polling 1K connections with 500 ms interval • Increases with #connections and polling freq.
• Solution: Adaptive tuning of polling frequency– Reduce polling frequency to stay within a target CPU– Devote more polling to more problematic connections
![Page 18: Profiling Network Performance in Multi-tier Datacenter Applications](https://reader036.vdocuments.mx/reader036/viewer/2022081514/56816547550346895dd7bfd7/html5/thumbnails/18.jpg)
18
SNAP in the Real World
![Page 19: Profiling Network Performance in Multi-tier Datacenter Applications](https://reader036.vdocuments.mx/reader036/viewer/2022081514/56816547550346895dd7bfd7/html5/thumbnails/19.jpg)
19
Key Diagnosis Steps
• Identify performance problems– Correlate across connections– Identify applications with severe problems
• Expose simple, useful information to developers– Filter important statistics and classification results
• Identify root cause and propose solutions – Work with operators and developers– Tune TCP stack or change application code
![Page 20: Profiling Network Performance in Multi-tier Datacenter Applications](https://reader036.vdocuments.mx/reader036/viewer/2022081514/56816547550346895dd7bfd7/html5/thumbnails/20.jpg)
20
SNAP Deployment
• Deployed in a production data center– 8K machines, 700 applications– Ran SNAP for a week, collected terabytes of data
• Diagnosis results– Identified 15 major performance problems– 21% applications have network performance problems
![Page 21: Profiling Network Performance in Multi-tier Datacenter Applications](https://reader036.vdocuments.mx/reader036/viewer/2022081514/56816547550346895dd7bfd7/html5/thumbnails/21.jpg)
21
Characterizing Perf. Limitations
Send Buffer
Receiver
Network
#Apps that are limited for > 50% of the time
1 App
6 Apps
8 Apps144 Apps
– Send buffer not large enough
– Fast retransmission – Timeout
– Not reading fast enough (CPU, disk, etc.)– Not ACKing fast enough (Delayed ACK)
![Page 22: Profiling Network Performance in Multi-tier Datacenter Applications](https://reader036.vdocuments.mx/reader036/viewer/2022081514/56816547550346895dd7bfd7/html5/thumbnails/22.jpg)
22
Three Example Problems
• Delayed ACK affects delay sensitive apps
• Congestion window allows sudden burst
• Significant timeouts for low-rate flows
![Page 23: Profiling Network Performance in Multi-tier Datacenter Applications](https://reader036.vdocuments.mx/reader036/viewer/2022081514/56816547550346895dd7bfd7/html5/thumbnails/23.jpg)
Problem 1: Delayed ACK • Delayed ACK affected many delay sensitive apps– even #pkts per record 1,000 records/sec odd #pkts per record 5 records/sec– Delayed ACK was used to reduce bandwidth usage and
server interrupts
23
Data
ACK
Data
A B
ACK
200 ms
….Proposed solutions:Delayed ACK should be disabled in data centers
ACK every other packet
![Page 24: Profiling Network Performance in Multi-tier Datacenter Applications](https://reader036.vdocuments.mx/reader036/viewer/2022081514/56816547550346895dd7bfd7/html5/thumbnails/24.jpg)
24
ReceiverSocket send buffer
Send Buffer and Delayed ACK
Application bufferApplication
1. Send complete
NetworkStack 2. ACK
With Socket Send Buffer
Receiver
Application bufferApplication
2. Send completeNetworkStack 1. ACK
Zero-copy send
• SNAP diagnosis: Delayed ACK and zero-copy send
![Page 25: Profiling Network Performance in Multi-tier Datacenter Applications](https://reader036.vdocuments.mx/reader036/viewer/2022081514/56816547550346895dd7bfd7/html5/thumbnails/25.jpg)
25
Problem 2: Congestion Window Allows Sudden Bursts
• Increase congestion window to reduce delay– To send 64 KB data with 1 RTT – Developers intentionally keep congestion window large– Disable slow start restart in TCP
t
WindowDrops after an idle time
![Page 26: Profiling Network Performance in Multi-tier Datacenter Applications](https://reader036.vdocuments.mx/reader036/viewer/2022081514/56816547550346895dd7bfd7/html5/thumbnails/26.jpg)
26
Slow Start Restart• SNAP diagnosis– Significant packet loss– Congestion window is too large after an idle period
• Proposed solutions– Change apps to send less data during congestion– New transport protocols that consider both congestion
and delay
![Page 27: Profiling Network Performance in Multi-tier Datacenter Applications](https://reader036.vdocuments.mx/reader036/viewer/2022081514/56816547550346895dd7bfd7/html5/thumbnails/27.jpg)
27
Problem 3: Timeouts for Low-rate Flows
• SNAP diagnosis– More fast retrans. for high-rate flows (1-10MB/s)– More timeouts with low-rate flows (10-100KB/s)
• Proposed solutions– Reduce timeout time in TCP stack– New ways to handle packet loss for small flows
![Page 28: Profiling Network Performance in Multi-tier Datacenter Applications](https://reader036.vdocuments.mx/reader036/viewer/2022081514/56816547550346895dd7bfd7/html5/thumbnails/28.jpg)
28
Conclusion• A simple, efficient way to profile data centers– Passively measure real-time network stack information– Systematically identify problematic stages– Correlate problems across connections
• Deploying SNAP in production data center– Diagnose net-app interactions– A quick way to identify them when problems happen
• Future work– Extend SNAP to diagnose wide-area transfers
![Page 29: Profiling Network Performance in Multi-tier Datacenter Applications](https://reader036.vdocuments.mx/reader036/viewer/2022081514/56816547550346895dd7bfd7/html5/thumbnails/29.jpg)
29
Thanks!