profiling network performance in multi-tier datacenter applications jori hardman carly ho paper by...

Profiling Network Performancein Multi-tier Datacenter Applications

Jori HardmanCarly Ho

Paper by Minlan Yu, Albert Greenberg, Dave Maltz, Jennifer Rexford, Lihua Yuan, Srikanth Kandula, Changhoon Kim

Applications inside Data Centers

Front end Server

Aggregator

Aggregator Aggregator … …Aggregator

Worker …Worker Worker …Worker Worker

Challenges of Datacenter Diagnosis

• Multi-tier applications o Tens of hundreds of application componentso Tens of thousands of servers

• Evolving applicationso Add new features, fix bugso Change components while app is still in operation

• Human factorso Developers may not understand network well o Nagle’s algorithm, delayed ACK, etc.

Where are the Performance Problems?• Network or application?

o App team: Why low throughput, high delay?o Net team: No equipment failure or congestion

• Network and application! -- their interactionso Network stack is not configured correctlyo Small application writes delayed by TCPo TCP incast: synchronized writes cause packet loss

A diagnosis tool to understand network-application interactions

Diagnosis in Today’s Data Center

Host

App

OS Packet sniffer

App logs:#Reqs/secResponse time1% req. >200ms delay

Switch logs:#bytes/pkts per minute

Packet trace:Filter out trace for long delay req.

SNAP:Diagnose net-app interactions

Application-specific

Too expensive

Too coarse-grainedGeneric, fine-grained, and lightweight

Full Knowledge of Data Centers

• Direct access to network stacko Directly measure rather than relying on inferenceo E.g., # of fast retransmission packets

• Application-server mappingo Know which application runs on which serverso E.g., which app to blame for sending a lot of traffic

• Network topology and routingo Know which application uses which resourceo E.g., which app is affected if a link is congested

SNAP: Scalable Net-App Profiler

Outline

• SNAP architectureo Passively measure real-time network stack infoo Systematically identify performance problemso Correlate across connections to pinpoint problems

• SNAP deploymento Operators: Characterize performance problemso Developers: Identify problems for applications

• SNAP validation and overhead

SNAP Architecture

Step 1: Network-stack measurements

What Data to Collect?

• Goals:o Fine-grained: in milliseconds or secondso Low overhead: low CPU overhead and data volumeo Generic across applications

• Two types of data:o Poll TCP statistics Network performanceo Event-driven socket logging App expectationo Both exist in today’s linux and windows systems

TCP statistics

• Instantaneous snapshotso #Bytes in the send buffero Congestion window size, receiver window sizeo Snapshots based on Poisson sampling

• Cumulative counterso #FastRetrans, #Timeouto RTT estimation: #SampleRTT, #SumRTTo RwinLimitTimeo Calculate difference between two polls

SNAP Architecture

Step 2: Performance problem classification

Life of Data Transfer

• Application generates the data

• Copy data to send buffer

• TCP sends data to the network

• Receiver receives the data and ACK

Sender App

Send Buffer

Receiver

Network

Classifying Socket Performance

• Bottlenecked by CPU, disk, etc.• Slow due to app design (small writes)

• Send buffer not large enough

• Fast retransmission • Timeout

• Not reading fast enough (CPU, disk, etc.)• Not ACKing fast enough (Delayed ACK)

Sender App

Send Buffer

Receiver

Network

Identifying Performance Problems

• Not any other problems

• Send buffer is almost full

• #Fast retransmission• #Timeout

• RwinLimitTime• Delayed ACK diff(SumRTT) > diff(SampleRTT)*MaxQueuingDelay

Sender App

Send Buffer

Receiver

Network

Direct measure

Sampling

Inference

SNAP Architecture

Step 3: Correlation across connections

Pinpoint Problems via Correlation

• Correlation over shared switch/link/hosto Packet loss for all the connections going through one

switch/hosto Pinpoint the problematic switch

Pinpoint Problems via Correlation

• Correlation over applicationo Same application has problem on all machineso Report aggregated application behavior

SNAP Architecture

SNAP Deployment

• Production data centero 8K machines, 700 applicationso Ran SNAP for a week, collected petabytes of data

• Operators: Profiling the whole data centero Characterize the sources of performance problemso Key problems in the data center

• Developers: Profiling individual applicationso Pinpoint problems in app software, network stack, and their

interactions

Performance Problem Overview

• A small number of apps suffer from significant performance problems

Problems >5% of the time > 50% of the time

Sender app 567 apps 551

Send buffer 1 1

Network 30 6

Recv win limit 22 8

Delayed ACK 154 144

SNAP diagnosis

• SNAP diagnosis steps:o Correlate connection performance to pinpoint

applications with problemso Expose socket and TCP stats o Find out root cause with operators and

developerso Propose potential solutions

Sender App

Send Buffer

Receiver

Network

Classifying Socket Performance• Bottlenecked by CPU, disk, etc.• Slow due to app design (small writes)




Sender App

Send Buffer

Receiver

Network

Send Buffer and Recv Window

• Problems on a single connection

App process

…WriteBytes

TCP

Send Buffer

App process

…ReadBytes

TCP

Recv Buffer

Some apps use default

8KB

Fixed max size 64KB not enough

for some apps

Need Buffer Autotuning• Problems of sharing buffer at a single host

o More send buffer problems on machines with more connections

o How to set buffer size cooperatively?• Auto-tuning send buffer and recv window

o Dynamically allocate buffer across applicationso Based on congestion window of each appo Tune send buffer and recv window together





Sender App

Send Buffer

Receiver

Network

Packet Loss in a Day in the Datacenter

• Packet loss burst every hour• 2-4 am is the backup time

Spread Writes over Multiple Connections • SNAP diagnosis:

o More timeouts than fast retransmissiono Small packet sending rate

• Root cause:o Two connections to avoid head-of-line blockingo Low-rate small requests gets more timeouts

• Solution:o Use one connection; Assign ID to each requesto Combine data to reduce timeouts

ReqReq ReqRespons

e

Congestion Window Allows Sudden Bursts

• SNAP diagnosis:o Significant packet losso Congestion window is too large after an idle period

• Root cause:o Slow start restart is disabled

Slow Start Restart• Slow start restart

o Reduce congestion window size if the connection is idle to prevent sudden burst

t

Window Drops after an idle time

Slow Start Restart• However, developers disabled it because:

o Intentionally increase congestion window over a persistent connection to reduce delayo E.g., if congestion window is large, it just takes 1 RTT to send 64

KB data• Potential solution:

o New congestion control for delay sensitive traffic





Sender App

Send Buffer

Receiver

Network

Timeout and Delayed ACK• SNAP diagnosis

o Congestion window drops to one after a timeouto Followed by a delayed ACK

• Solution: o Congestion window drops to two

200ms ACK Delay

W1: write() less than MSS

W2: write() less than MSS

Nagle and Delayed ACK

TCP/IPApp Network

TCP segment with W1

TCP segment with W2

ACK for W1

TCP/IP App

read() W1

read() W2

• SNAP diagnosiso Delayed ACK and small writes

ReceiverSocket send buffer

Send Buffer and Delayed ACK

Application bufferApplication

1. Send complete

NetworkStack 2. ACK

With Send Buffer

Receiver

Application bufferApplication

2. Send complete

NetworkStack 1. ACK

Set Send Buffer to zero

• SNAP diagnosis: Delayed ACK and send buffer = 0

SNAP Validation and Overhead

Correlation Accuracy• Inject two real problems• Mix labeled data with real production data• Correlation over shared machine• Successfully identified those labled machines

2.7% of machines have ACC > 0.4

SNAP Overhead

• Data volumeo Socket logs: 20 Bytes per socketo TCP statistics: 120 Bytes per connection per poll

• CPU overheado Log socket calls: event-driven, < 5% o Read TCP tableo Poll TCP statistics

Reducing CPU Overhead• CPU overhead

o Polling TCP statistics and reading TCP tableo Increase with number of connections and polling freq.o E.g., 35% for polling 5K connections with 50 ms interval 5% for polling 1K connections with 500 ms interval

• Adaptive tuning of polling frequencyo Reduce polling frequency to stay within a target CPUo Devote more polling to more problematic connections

Conclusion

• A simple, efficient way to profile data centerso Passively measure real-time network stack informationo Systematically identify components with problemso Correlate problems across connections

• Deploying SNAP in production data centero Characterize data center performance problems

Help operators improve platform and tune networko Discover app-net interactions

Help developers to pinpoint app problems

profiling network performance in multi-tier datacenter applications jori hardman carly ho paper by...

Documents

data volumegeneric

datacopy data

network receiver

app team

applicationstwo types

network stackdirectly

performance problemscorrelate

performance problemsdevelopers