new challenges in network diagnosis

72
New Challenges in Network Diagnosis Minlan Yu Harvard University Joint Work with Yuliang Li, Jiaqi Gao, Rui Miao, Mohammad Alizadeh, Nofel Yaseen, Robert MacDavid, Felipe Vieira Frujeri, Vincent Liu, Ricardo Bianchini, Ramaswamy Aditya, Xiaohang Wang, Henry Lee, David Maltz, Behnaz Arzani

Upload: others

Post on 25-Jan-2022

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: New Challenges in Network Diagnosis

New Challenges in Network Diagnosis

Minlan YuHarvard University

Joint Work with Yuliang Li, Jiaqi Gao, Rui Miao, Mohammad Alizadeh, Nofel Yaseen, Robert MacDavid, Felipe Vieira Frujeri, Vincent Liu, Ricardo Bianchini,

Ramaswamy Aditya, Xiaohang Wang, Henry Lee, David Maltz, Behnaz Arzani

Page 2: New Challenges in Network Diagnosis

Diagnosing Network Applications

• Many networked applications in large production networks

• Require high Availability and SLO (Service-level Objectives)

• Need to quickly locate problems when things go wrong

2

Page 3: New Challenges in Network Diagnosis

New Challenges for Diagnosis

3

Ultra low latency(The killer microseconds)

Growing complexity(Involves many components and

layers in cloud-scale services)

At Scale(~100K servers/VMs/processes)

Page 4: New Challenges in Network Diagnosis

Challenge 1: Ultra Low Latency

• The killer microseconds

4

GPU, TPU O(10us)

flash O(10us), NVM O(1us)

RDMA O(1us)

Ultra low latency Applications (real-time reinforcement learning, large-scale distributed

systems, packet-level traffic analysis)

Page 5: New Challenges in Network Diagnosis

Many Fine-time scale Events

5

• Performance is sensitive to many fine-time scale eventso More intermittent eventso Many events happen at the same timeo Small events have cascading impact across components and over time

Link failure

DDoS

Traf

fic su

rge

Packet delay

Lost packet

Pack

et b

urst

Thre

ad b

urst

s

IncastGabage collection

Interrupts

Congestion

Cache misses

Loop

Context switching

Burst Loss

How to collect these fine-grained events at scale?How to correlate them with performance problems?

Page 6: New Challenges in Network Diagnosis

Challenge 1: Ultra Low Latency at Scale

• Overall latency ≥ latency of slowest component (Tail latency!!!)– small blips on individual flows cause delays– touching more machines increases likelihood of delays

6

Bing query workflow (SIGCOMM’13)

Page 7: New Challenges in Network Diagnosis

Challenge 2: Growing Complexity

• Developers have to master growing complexity

7

NetworkingStorageDatabasesLoad balancingSecurityISPs

Optical linksSwitchesServersNICsAccelerators

Packet lossesLong delayBurstThroughput dropsTransient stallsConnectivity issues

TopologyRoutingDevice configsOS stackTransport…

Find needles in the haystack

Page 8: New Challenges in Network Diagnosis

Challenge 2: Growing Complexity

• Developers have to master growing complexity• The blame game– “There must be something wrong on the component that I don’t control or

understand”– “Things work fine at my component before and nothing changed”

• Network is often the target for blames– Interconnected with many components– Less visible to other upper layer applications

8

Automatically identify the right team/component

Page 9: New Challenges in Network Diagnosis

Challenge 2: Growing Complexity

• Developers have to master growing complexity• The blame game– “There must be something wrong on the component that I don’t control or

understand”– “Things work fine at my component before and nothing changed”

• Network is often the target for blames– Interconnected with many components– Less visible to other upper layer applications

9

Automatically identify the right team/component

Page 10: New Challenges in Network Diagnosis

New Challenges for Diagnosis

10

Ultra low latency(The killer microseconds)

Growing complexity(Involves many teams and

layers in cloud-scale services)

At Scale(~100K servers/VMs/processes)

Fine-grained data collection at scale

Automatic analysis to identify the responsible components

Page 11: New Challenges in Network Diagnosis

This talk

• DETER: Record-and-Replay for TCP– Collect detailed yet lightweight packet information at scale

• Scouts: Domain-customized incident routing – Automatically direct incident tickets to the right team

11

Page 12: New Challenges in Network Diagnosis

Detailed packet-level information for TCP diagnosis

12

Page 13: New Challenges in Network Diagnosis

TCP performance diagnosis is important• Highly distributed applications in large production networks– These apps rely on high throughput, low latency of all the TCP connections

• Yet, TCP problems happen all the time– Tail latency matters– A single flow with long latency can slow down the entire job

13

Page 14: New Challenges in Network Diagnosis

Why is diagnosing TCP hard?

• TCP in the text book:

14

Slowstart

Cong.Avoid.

FastrecoverySender

Receiver

SendACK

Page 15: New Challenges in Network Diagnosis

TCP is complex!

• Reality…

15

Congestion control

Loss recovery

Send buffer manager

Socket call manager

ACKprocessor

Packetgenerator

Sender

Timer manager

Send window manager

Pacing rate manager

Nagle testTSO

Congestion control

Packetprocessor

ACKgenerator

Recv buffer manager

Attack mitigation

Socket call manager

Delayed ACK manager

Receiver

Recv window manager

Sender

Receiver

Page 16: New Challenges in Network Diagnosis

TCP is complex!

• Unexpected interactions between diff components

16

Congestion control

Loss recovery

Send buffer manager

Socket call manager

ACKprocessor

Packetgenerator

Sender

Timer manager

Send window manager

Pacing rate manager

Nagle testTSO

Congestion control

Packetprocessor

ACKgenerator

Recv buffer manager

Attack mitigation

Socket call manager

Delayed ACK manager

Receiver

Recv window managerRecv window manager

Loss recovery

Sender

No fast recovery

Receiver

Page 17: New Challenges in Network Diagnosis

TCP is complex!

• Unexpected interactions between diff components

17

Congestion control

Loss recovery

Send buffer manager

Socket call manager

ACKprocessor

Packetgenerator

Sender

Timer manager

Send window manager

Pacing rate manager

Nagle testTSO

Congestion control

Packetprocessor

ACKgenerator

Recv buffer manager

Attack mitigation

Socket call manager

Delayed ACK manager

Receiver

Recv window manager

Sender

Receiver

Ignore some packets

No response

Send window manager

Attack mitigation

Page 18: New Challenges in Network Diagnosis

TCP is complex!

Congestion control

Loss recovery

Send buffer manager

Socket call manager

ACKprocessor

Packetgenerator

Sender

Timer manager

Send window manager

Pacing rate manager

Nagle testTSO

Congestion control

Packetprocessor

ACKgenerator

Recv buffer manager

Attack mitigation

Socket call manager

Delayed ACK manager

Receiver

Recv window manager

• Unexpected interactions between diff components• 63 parameters in Linux 4.4 TCP that tune the behaviors of diff

components• Continuous error-prone development:– 18 bugs found in July & Aug of 2018 in Linux TCP

18

Page 19: New Challenges in Network Diagnosis

How do we diagnose TCP today?

19

Tcpdump

Page 20: New Challenges in Network Diagnosis

Detailed diagnosis is not scalable

20

1990 2000 2010 2019

Tcpdump Tcpdump Tcpdump Tcpdump

• Bandwidth: 10Mbps to 100Gbps• #hosts: 10s to 100,000s

• Too much overhead!

Page 21: New Challenges in Network Diagnosis

Tension between more details and low overhead• Existing tools cannot achieve both• DETERministic Record and Replay

21Details for diagnosis

Overhead

Tcpdump

Tcp probe

Tcp counters

DETERebpf

Lots of details, but high overhead

Low overhead, but miss lots of details

All details, low overhead

Runtime record = Data for diagnosis

Runtime record < Data for diagnosis

Page 22: New Challenges in Network Diagnosis

DETER overview

22

DETER Recorder

Runtime Replay

10.0.0.1:80->20.0.0.1:1234 has long latency

10.0.0.1:80 -> 20.0.0.1:1234DETER

Replayer

Lightweight recordRun continuouslyOn all hosts

Deterministic replayCapture packets/countersTrace executionsIterative diagnosis

Tcpdump

TCP Probe

20.0.0.1:1234 -> 10.0.0.1:80

× N

Page 23: New Challenges in Network Diagnosis

23

Lightweight record Deterministic replay

Page 24: New Challenges in Network Diagnosis

Intuition for being lightweight

24

TCPsockcall TCP sock

call

Lightweight record Deterministic replayFAIL!

Record socket calls Automatically generate packets

Page 25: New Challenges in Network Diagnosis

Non-deterministic interactions w/ many parties

25

TCPsockcall TCP sock

call

Page 26: New Challenges in Network Diagnosis

Non-deterministic interactions w/ many parties

26

kernelTCPsock

callkernel

TCP sockcall

kernelTCPsock

callkernel

TCP sockcall

kernelTCPsock

callkernel

TCP sockcall

Key contribution: • Identifying the minimum set of data that enables deterministic replay

Two challenges:• Network wide: non-deterministic interactions across switches and TCP• On host: non-determinisms within the kernel

Butterfly effect

Page 27: New Challenges in Network Diagnosis

Challenge 1: butterfly effect

• The closed loop between TCP and switches amplifies small noises

2727

kernelTCPsock

callkernel

TCP sockcall

kernelTCPsock

callkernel

TCP sockcall

kernelTCPsock

callkernel

TCP sockcall

TCP TCP

TCP TCP

TCP TCP

Page 28: New Challenges in Network Diagnosis

Challenge 1: butterfly effect

28

enqueue

drop

drop

enqueueCong_win/=2

Cong_win++ Cong_win/=2

Cong_win++

Sending time variation Switch action variation

TCP behavior variationRuntime Replay

1 us late

µs-level:Clock drift, context switching, kernel scheduling, cache state

TCPsockcall

TCPsockcall

TCPsockcall

TCPsockcall

Page 29: New Challenges in Network Diagnosis

Challenge 1: butterfly effect

29

drop enqueue

Sending time variation Switch action variation

TCP behavior variationRuntime

TCPsockcall

TCPsockcall

TCPsockcall

Cong_win/=2

Cong_win++

Replay

Cong_win/=2

Cong_win++

TCPsockcall

TCPsockcall

TCPsockcall

Butterfly effect

Page 30: New Challenges in Network Diagnosis

Reduce Butterfly Effect

30

Sending time variation Switch action variation

TCP behavior variationButterfly effect

TCPTCP sockcall

sockcall TCPTCPsockcall

sockcallTCPsock

call TCP sockcall

Record&replayRecord&replay

Butterfly effect

Page 31: New Challenges in Network Diagnosis

Challenge 1: butterfly effect

31

• Record all the packets into TCP?

TCPTCP sockcall

sockcall TCPTCPsockcall

sockcallTCPsock

call TCP sockcall

Record&replayRecord&replay

High overhead

Page 32: New Challenges in Network Diagnosis

Challenge 1: butterfly effect

• Solution: record&replay packet stream mutations

32

Runtime

TCPTCP sockcall

sockcall TCPTCPsockcall

sockcallTCPsock

call TCP sockcall

Replay

TCPTCP sockcall

sockcall TCPTCPsockcall

sockcallTCPsock

call TCP sockcall

Record mutations

Record mutations

Drops, ECN, reordering, etc. Drops, ECN, reordering, etc.

Replay mutations

Replay mutations

Page 33: New Challenges in Network Diagnosis

Challenge 1: butterfly effect

• Solution: record&replay packet stream mutations

33

Runtime

TCPTCP sockcall

sockcall TCPTCPsockcall

sockcallTCPsock

call TCP sockcall

Replay

TCPTCP sockcall

sockcall TCPTCPsockcall

sockcallTCPsock

call TCP sockcall

Record mutations

Record mutations

Drops, ECN, reordering, etc. Drops, ECN, reordering, etc.

Replay mutations

Replay mutations

+ Low overhead: Drop rate < 10-4; ECN: 1 bit/packet; Reordering is rare

+ Replaying each TCP connection is independentConnections interact via drops and ECN, which we replay.

+ Need no switches for replay

Resource-efficient replay:- Just need two hosts

Page 34: New Challenges in Network Diagnosis

Implementation

• DETER in Linux 4.4• Just need 139 lines of changes to Linux TCP

• Lightweight recording • Storage: 2.1%~3.1% compared to compressed packet header traces.• CPU: < 1.5%

34

Page 35: New Challenges in Network Diagnosis

Case study in Spark

• Terasort 200 GB on 20 servers (4 cores each) on EC2, 6.2K connections• Replay and collect trace for flows with 99.9 percentile latency

35

Page 36: New Challenges in Network Diagnosis

Case study in Spark

• Iteratively debug individual flows • E.g., delayed ACK

• Packet traces• Burst-40ms-ACK pattern

• Trace TCP executions• The receiver explicitly delays the ACK, • because the recv buffer is shrinking• Caused by the slow receiver

36

Page 37: New Challenges in Network Diagnosis

Other use cases

• RTO caused by different reasons– Delayed ACK, Exponential backoff, – small messages, misconfiguration of receiver buffer size

• We can also diagnose problems in the switches– Because we have traces, we can push packets into the network– In simulation (requires modeling switch data plane accurately)– Case study: A temporary blackhole caused by switch buffer sharing

37

Page 38: New Challenges in Network Diagnosis

Conclusion

• Performance diagnosis in large networks is challenging• Many problems in the TCP stack

• DETER enables light weight recording and deterministic TCP replay• Key challenge: butterfly effect between TCP and switches• Record & replay packet stream mutations to break the closed loop

• Deter is opensourced• https://github.com/harvard-cns/deter

38

Page 39: New Challenges in Network Diagnosis

Scouts

39

Automatic diagnosis using domain-customized incident routing

Page 40: New Challenges in Network Diagnosis

40

Number of public incidentsbetween February to July 2020 2169

Maximum resolution time

Average resolution time

14 h 12 m 19 h 49 m

4 h 40 m 5 h 28 min

Incidents can and do happen

Page 41: New Challenges in Network Diagnosis

Life cycle of an incident

41

REPORT INCIDENT

Monitoring system

Customer reports

DIAGNOSE AND FIX PROBLEM

Fix

FIND THE FAILING COMPONENT

Find right team

Check monitoring systems

Can we fix it?

Find problematicdevice

Understandthe cause

Page 42: New Challenges in Network Diagnosis

0

0.2

0.4

0.6

0.8

1

1E-8 1E-7 1E-6 1E-5 1E-4 1E-3 1E-2 1E-1 1E+0

CDF

Time (normalized)

Multiple teams investigate

Single team investigates

Finding the right team is time consuming

42

10x

FIND THE FAILING COMPONENT

Find right team

Check monitoring systems

Can we fix it?

Page 43: New Challenges in Network Diagnosis

Example incident: storage problem

43

1. Can’t write to storage!2. Must be storage issue3. Storage is good, network must be slow4. No congested links5. Need more information from customer6. Connection fail to init, SLB failing7. SLB is good, network must be dropping8. Packet is reaching to SLB9. Customer opens too many connections and

exhaust SNAT pool, behavior is expectedSLB Network Storage

Page 44: New Challenges in Network Diagnosis

44

Studied 200 misrouted incidents in Azure

Why multiple teams get involved?

Page 45: New Challenges in Network Diagnosis

1. Lack of domain knowledge

45

?

▷ Storage team doesn’t know network is functioning or not▷ Team level dependencies are hard to reason about

Page 46: New Challenges in Network Diagnosis

2. No cloud teams are responsible, more misrouting

46

▷ ISP or customer outside the cloud is experiencing issues

Page 47: New Challenges in Network Diagnosis

3. Concurrent incidents

47

▷ One failure causes multiple incidents in multiple teams

Page 48: New Challenges in Network Diagnosis

48

How to reduce misrouting?

Page 49: New Challenges in Network Diagnosis

Existing solutions

Application specific diagnosis system

Natural language processing

49

NetPoirot[SIGCOMM-16]

DeepView[NSDI-18]

Too many applications in the data center

NetSieve[NSDI-13]

Ignores essential domain knowledge

Sherlock [SIGCOMM-07]

Page 50: New Challenges in Network Diagnosis

Incident routing problem revisit

50

Incident routing system

Incident

Domain knowledge

Monitor data

SLB

Network

Storage

Well-defined algorithm?Hard!Machine learning?

Page 51: New Challenges in Network Diagnosis

Solve the whole problem at once?

51

Uneven instrumentation Limited visibilityCurse of dimensionality

▷ Hard to build a single, monolithic incident routing system

Constantly changingHuge feature vector with no enough training examples

A subset of teams will always have gaps in monitoring

Stale components and monitors

Hard to understand appropriate feature set for each team

Page 52: New Challenges in Network Diagnosis

Scout: team-specialized ML-assisted gatekeeper

▷ “Is my team responsible for the incident?”

52

One team, one scout

Leverage domain knowledge

Evolve independently

Page 53: New Challenges in Network Diagnosis

Scout design

53

Incident Monitor data

Classification resultDomain Knowledge

Computation engines

Page 54: New Challenges in Network Diagnosis

Physical networking team

54

ScopeEvery switch & router in DC

11 Monitor systemsPingMesh, Everflow, NetBouncer, etc.

Statistics

58% incidents investigated by PhyNet went through multiple teams

97.6 hours per day wasted on unnecessary investigations

Page 55: New Challenges in Network Diagnosis

▷ How to process huge amount of monitoring data?

55

CHALLENGE 1

Millions of devices in the Cloud

Page 56: New Challenges in Network Diagnosis

Incident guided investigation

56

Incident Description

Regular Expression

Devices Monitor data

“Server X.c10.dc3 is experiencing problem connecting to storage cluster c4.dc1”

Server: X.c10.dc3Cluster: c4.dc1

CPU UsageLink loss ratePing latency

Page 57: New Challenges in Network Diagnosis

▷ How to create a feature vector out of the monitoring data?

57

CHALLENGE 2

Different incidents have variable number of devices

Mixed types of monitoring data(event-driven vs time series)

Page 58: New Challenges in Network Diagnosis

How to build a fixed width feature vector?

▷ Per-component feature

○ Event: count number of events during the incident period

○ Time-series: normalize and calculate statistics (percentiles, average, etc.)

▷ Multiple components

○ Compute statistics across multiple components (percentiles, average, etc.)

58

Page 59: New Challenges in Network Diagnosis

▷ Which computation engine?

59

CHALLENGE 3

Page 60: New Challenges in Network Diagnosis

Low accuracy on new incidents

Interpretable, able to provide more insights

Learns based on history incidents, high accuracy

Supervised learning

60

: random forest

Page 61: New Challenges in Network Diagnosis

Change point detection for new incidents

61

Page 62: New Challenges in Network Diagnosis

Low accuracy on old incidents

Higher accuracy on new incidents

Easy to compute

Change point detection for new incidents

62

Page 63: New Challenges in Network Diagnosis

Model selector

63

Use meta-learning to identify new incidents

Incident itself tells whether it is new or not

Page 64: New Challenges in Network Diagnosis

The anatomy of a Scout

64

Configuration file

Incident

Monitor data

Model Selector

Classification result

Computation engines

Random Forest

Changepoint

detection

Page 65: New Challenges in Network Diagnosis

65

Evaluation

Page 66: New Challenges in Network Diagnosis

Evaluation setup

DATASET9 months of incidents in AzureRandomly split into training and testing set

LABELWhether incident is resolved by PhyNet

BASELINECurrent incident routing system without ScoutRunbooks, past-experience, NLP based routing system

66

Page 67: New Challenges in Network Diagnosis

Overall performance

67

Precision Recall F1-Score

PhyNet Scout 97.5% 97.7% 0.98

Baseline 87.2% 91.9% 0.89

Delta 10.3% 5.8% 0.09

10% improvement in accuracy means significant reduction in investigation time

Page 68: New Challenges in Network Diagnosis

Benefit of the PhyNet Scout

68

Gain inSend incident to PhyNet directly

SLB

PhyNet

Storage0

0.2

0.4

0.6

0.8

1

0 20 40 60 80 100

CDF

Fraction of investigation time (%)

Gain-in

Best possible gain

Save more than 20% of the totalinvestigation time in 40% of incidents

Page 69: New Challenges in Network Diagnosis

Benefit of the PhyNet Scout

69

Gain outReject incident to PhyNet

Storage

SLB

PhyNet0

0.2

0.4

0.6

0.8

1

0 20 40 60 80 100

CDF

Fraction of investigation time (%)

Gain-out

Best possible gain

Close to the best possible gain

Page 70: New Challenges in Network Diagnosis

Conclusion

▷ Incident misrouting is the main challenge for maintaining service level objectives in the cloud

▷ Scout: a distributed team-specialized gate-keeper can reduce investigation time.

71

Page 71: New Challenges in Network Diagnosis

This talk

▷ DETER: Record-and-Replay for TCP○ Collect detailed yet lightweight packet information at

scale

▷ Scouts: Domain-customized incident routing ○ Automatically direct incident tickets to the right team

72

Page 72: New Challenges in Network Diagnosis

Future Directions in Diagnosis

• New application trends– Ultra low latency: the killer microseconds– Complex structure, especially with cross-layer design – Diagnosis is increasingly important for performance optimization

• Open questions– How to collect fine-time scale events at large scale?– How to tear apart causal relations across layers, across components, across

applications?– Data driven approaches for diagnosis– Customized diagnosis tools for new applications

73