![Page 1: NetPoirot: Taking the blame game out of data center …conferences.sigcomm.org/sigcomm/2016/files/program/...NetPoirot: Taking The Blame Game Out of Data Center Operations Behnaz Arzani,](https://reader033.vdocuments.mx/reader033/viewer/2022060917/60aa1c41b01dda306a27bdeb/html5/thumbnails/1.jpg)
NetPoirot: Taking The Blame Game Out of Data Center Operations
Behnaz Arzani, Selim Ciraci, Boon Thau Loo,
Assaf Schuster, Geoff Outhred
![Page 2: NetPoirot: Taking the blame game out of data center …conferences.sigcomm.org/sigcomm/2016/files/program/...NetPoirot: Taking The Blame Game Out of Data Center Operations Behnaz Arzani,](https://reader033.vdocuments.mx/reader033/viewer/2022060917/60aa1c41b01dda306a27bdeb/html5/thumbnails/2.jpg)
Datacenters can fail …
2
![Page 3: NetPoirot: Taking the blame game out of data center …conferences.sigcomm.org/sigcomm/2016/files/program/...NetPoirot: Taking The Blame Game Out of Data Center Operations Behnaz Arzani,](https://reader033.vdocuments.mx/reader033/viewer/2022060917/60aa1c41b01dda306a27bdeb/html5/thumbnails/3.jpg)
Failures are disruptive
••
•
•
3
![Page 4: NetPoirot: Taking the blame game out of data center …conferences.sigcomm.org/sigcomm/2016/files/program/...NetPoirot: Taking The Blame Game Out of Data Center Operations Behnaz Arzani,](https://reader033.vdocuments.mx/reader033/viewer/2022060917/60aa1c41b01dda306a27bdeb/html5/thumbnails/4.jpg)
Why is debugging hard?
4
Penn researcher
Azure VM Azure Network Service X
Network
![Page 5: NetPoirot: Taking the blame game out of data center …conferences.sigcomm.org/sigcomm/2016/files/program/...NetPoirot: Taking The Blame Game Out of Data Center Operations Behnaz Arzani,](https://reader033.vdocuments.mx/reader033/viewer/2022060917/60aa1c41b01dda306a27bdeb/html5/thumbnails/5.jpg)
NetworkNetwork
`
Someone accepts responsibility Each blames the other
5
In the case of a failure…
![Page 6: NetPoirot: Taking the blame game out of data center …conferences.sigcomm.org/sigcomm/2016/files/program/...NetPoirot: Taking The Blame Game Out of Data Center Operations Behnaz Arzani,](https://reader033.vdocuments.mx/reader033/viewer/2022060917/60aa1c41b01dda306a27bdeb/html5/thumbnails/6.jpg)
A real example… Event X
•
••
•
6
![Page 7: NetPoirot: Taking the blame game out of data center …conferences.sigcomm.org/sigcomm/2016/files/program/...NetPoirot: Taking The Blame Game Out of Data Center Operations Behnaz Arzani,](https://reader033.vdocuments.mx/reader033/viewer/2022060917/60aa1c41b01dda306a27bdeb/html5/thumbnails/7.jpg)
Current tools are insufficient
SherlockSIGCOMM-07
NetMedicSIGCOMM-09NSDI-11
TRatSIGCOMM-02 Netprofile
rP2Psys-05
7
![Page 8: NetPoirot: Taking the blame game out of data center …conferences.sigcomm.org/sigcomm/2016/files/program/...NetPoirot: Taking The Blame Game Out of Data Center Operations Behnaz Arzani,](https://reader033.vdocuments.mx/reader033/viewer/2022060917/60aa1c41b01dda306a27bdeb/html5/thumbnails/8.jpg)
Can we do better? (Overview)
• Introducing…
8
NetPoirot
Fault injector
Learning Agent
![Page 9: NetPoirot: Taking the blame game out of data center …conferences.sigcomm.org/sigcomm/2016/files/program/...NetPoirot: Taking The Blame Game Out of Data Center Operations Behnaz Arzani,](https://reader033.vdocuments.mx/reader033/viewer/2022060917/60aa1c41b01dda306a27bdeb/html5/thumbnails/9.jpg)
The monitoring agent
•
•
•
••
•
•
9
![Page 10: NetPoirot: Taking the blame game out of data center …conferences.sigcomm.org/sigcomm/2016/files/program/...NetPoirot: Taking The Blame Game Out of Data Center Operations Behnaz Arzani,](https://reader033.vdocuments.mx/reader033/viewer/2022060917/60aa1c41b01dda306a27bdeb/html5/thumbnails/10.jpg)
What is the TCP event digest?
•
•
•
10
![Page 11: NetPoirot: Taking the blame game out of data center …conferences.sigcomm.org/sigcomm/2016/files/program/...NetPoirot: Taking The Blame Game Out of Data Center Operations Behnaz Arzani,](https://reader033.vdocuments.mx/reader033/viewer/2022060917/60aa1c41b01dda306a27bdeb/html5/thumbnails/11.jpg)
Why do we think this can work?
••
•
•
••
11
![Page 12: NetPoirot: Taking the blame game out of data center …conferences.sigcomm.org/sigcomm/2016/files/program/...NetPoirot: Taking The Blame Game Out of Data Center Operations Behnaz Arzani,](https://reader033.vdocuments.mx/reader033/viewer/2022060917/60aa1c41b01dda306a27bdeb/html5/thumbnails/12.jpg)
To distinguish failures…
•
••
12
![Page 13: NetPoirot: Taking the blame game out of data center …conferences.sigcomm.org/sigcomm/2016/files/program/...NetPoirot: Taking The Blame Game Out of Data Center Operations Behnaz Arzani,](https://reader033.vdocuments.mx/reader033/viewer/2022060917/60aa1c41b01dda306a27bdeb/html5/thumbnails/13.jpg)
Decision trees…
•
13His uncertainty is X
![Page 14: NetPoirot: Taking the blame game out of data center …conferences.sigcomm.org/sigcomm/2016/files/program/...NetPoirot: Taking The Blame Game Out of Data Center Operations Behnaz Arzani,](https://reader033.vdocuments.mx/reader033/viewer/2022060917/60aa1c41b01dda306a27bdeb/html5/thumbnails/14.jpg)
Decision trees…
••
14His uncertainty is X-Y
![Page 15: NetPoirot: Taking the blame game out of data center …conferences.sigcomm.org/sigcomm/2016/files/program/...NetPoirot: Taking The Blame Game Out of Data Center Operations Behnaz Arzani,](https://reader033.vdocuments.mx/reader033/viewer/2022060917/60aa1c41b01dda306a27bdeb/html5/thumbnails/15.jpg)
Decision trees alone are not enough
15
![Page 16: NetPoirot: Taking the blame game out of data center …conferences.sigcomm.org/sigcomm/2016/files/program/...NetPoirot: Taking The Blame Game Out of Data Center Operations Behnaz Arzani,](https://reader033.vdocuments.mx/reader033/viewer/2022060917/60aa1c41b01dda306a27bdeb/html5/thumbnails/16.jpg)
Decision trees alone are not enough
16
![Page 17: NetPoirot: Taking the blame game out of data center …conferences.sigcomm.org/sigcomm/2016/files/program/...NetPoirot: Taking The Blame Game Out of Data Center Operations Behnaz Arzani,](https://reader033.vdocuments.mx/reader033/viewer/2022060917/60aa1c41b01dda306a27bdeb/html5/thumbnails/17.jpg)
Decision trees alone are not enough
17Feature 1
Fe
atu
re 2
![Page 18: NetPoirot: Taking the blame game out of data center …conferences.sigcomm.org/sigcomm/2016/files/program/...NetPoirot: Taking The Blame Game Out of Data Center Operations Behnaz Arzani,](https://reader033.vdocuments.mx/reader033/viewer/2022060917/60aa1c41b01dda306a27bdeb/html5/thumbnails/18.jpg)
Decision trees alone are not enough
Easiest to
18
Hardest to classify
Fe
atu
re 2
Feature 1
![Page 19: NetPoirot: Taking the blame game out of data center …conferences.sigcomm.org/sigcomm/2016/files/program/...NetPoirot: Taking The Blame Game Out of Data Center Operations Behnaz Arzani,](https://reader033.vdocuments.mx/reader033/viewer/2022060917/60aa1c41b01dda306a27bdeb/html5/thumbnails/19.jpg)
What we do to deal with this
19
Fe
atu
re 2
Feature 1
![Page 20: NetPoirot: Taking the blame game out of data center …conferences.sigcomm.org/sigcomm/2016/files/program/...NetPoirot: Taking The Blame Game Out of Data Center Operations Behnaz Arzani,](https://reader033.vdocuments.mx/reader033/viewer/2022060917/60aa1c41b01dda306a27bdeb/html5/thumbnails/20.jpg)
Upper portion of an example tree…
20
Mean of max congestion window
Min of the last congestion window
50th percentile of number of triple duplicate ACKs
50th percentile of connection duration
Max of the number of triple duplicate Acks
95th percentile of the max congestion window
![Page 21: NetPoirot: Taking the blame game out of data center …conferences.sigcomm.org/sigcomm/2016/files/program/...NetPoirot: Taking The Blame Game Out of Data Center Operations Behnaz Arzani,](https://reader033.vdocuments.mx/reader033/viewer/2022060917/60aa1c41b01dda306a27bdeb/html5/thumbnails/21.jpg)
What we do to deal with this
21
Fe
atu
re 2
Feature 1
![Page 22: NetPoirot: Taking the blame game out of data center …conferences.sigcomm.org/sigcomm/2016/files/program/...NetPoirot: Taking The Blame Game Out of Data Center Operations Behnaz Arzani,](https://reader033.vdocuments.mx/reader033/viewer/2022060917/60aa1c41b01dda306a27bdeb/html5/thumbnails/22.jpg)
Upper portion of an example tree…
22
50TH percentile of the max RTT
Number of flows
50th percentile of amount of data received
95th percentile of the number of timeouts
![Page 23: NetPoirot: Taking the blame game out of data center …conferences.sigcomm.org/sigcomm/2016/files/program/...NetPoirot: Taking The Blame Game Out of Data Center Operations Behnaz Arzani,](https://reader033.vdocuments.mx/reader033/viewer/2022060917/60aa1c41b01dda306a27bdeb/html5/thumbnails/23.jpg)
Decision trees alone are not enough
23Feature 1
Fe
atu
re 2
![Page 24: NetPoirot: Taking the blame game out of data center …conferences.sigcomm.org/sigcomm/2016/files/program/...NetPoirot: Taking The Blame Game Out of Data Center Operations Behnaz Arzani,](https://reader033.vdocuments.mx/reader033/viewer/2022060917/60aa1c41b01dda306a27bdeb/html5/thumbnails/24.jpg)
The upper portion of an example tree…
24
Mean time spent in zero window probing
95th percentile of the ratio of number of bytes posted
to received
Number of flows
Number of flows
95th percentile of connection durations
Minimum of the number of bytes received
![Page 25: NetPoirot: Taking the blame game out of data center …conferences.sigcomm.org/sigcomm/2016/files/program/...NetPoirot: Taking The Blame Game Out of Data Center Operations Behnaz Arzani,](https://reader033.vdocuments.mx/reader033/viewer/2022060917/60aa1c41b01dda306a27bdeb/html5/thumbnails/25.jpg)
25
Is it a network failure?
Is it a server problem?
Is it a client side problem?
![Page 26: NetPoirot: Taking the blame game out of data center …conferences.sigcomm.org/sigcomm/2016/files/program/...NetPoirot: Taking The Blame Game Out of Data Center Operations Behnaz Arzani,](https://reader033.vdocuments.mx/reader033/viewer/2022060917/60aa1c41b01dda306a27bdeb/html5/thumbnails/26.jpg)
Other details
••
•
•
26
If throughput < x:Open more
connections
If throughput <x:Send more data on the same connection
![Page 27: NetPoirot: Taking the blame game out of data center …conferences.sigcomm.org/sigcomm/2016/files/program/...NetPoirot: Taking The Blame Game Out of Data Center Operations Behnaz Arzani,](https://reader033.vdocuments.mx/reader033/viewer/2022060917/60aa1c41b01dda306a27bdeb/html5/thumbnails/27.jpg)
What did we learn from all this?
••
••
•
••
••
27
![Page 28: NetPoirot: Taking the blame game out of data center …conferences.sigcomm.org/sigcomm/2016/files/program/...NetPoirot: Taking The Blame Game Out of Data Center Operations Behnaz Arzani,](https://reader033.vdocuments.mx/reader033/viewer/2022060917/60aa1c41b01dda306a27bdeb/html5/thumbnails/28.jpg)
Evaluation
••
•
•
••
•
28
![Page 29: NetPoirot: Taking the blame game out of data center …conferences.sigcomm.org/sigcomm/2016/files/program/...NetPoirot: Taking The Blame Game Out of Data Center Operations Behnaz Arzani,](https://reader033.vdocuments.mx/reader033/viewer/2022060917/60aa1c41b01dda306a27bdeb/html5/thumbnails/29.jpg)
How did we get labeled data?
•
•
••
•
•
•
29
![Page 30: NetPoirot: Taking the blame game out of data center …conferences.sigcomm.org/sigcomm/2016/files/program/...NetPoirot: Taking The Blame Game Out of Data Center Operations Behnaz Arzani,](https://reader033.vdocuments.mx/reader033/viewer/2022060917/60aa1c41b01dda306a27bdeb/html5/thumbnails/30.jpg)
Worse case application
•
30
![Page 31: NetPoirot: Taking the blame game out of data center …conferences.sigcomm.org/sigcomm/2016/files/program/...NetPoirot: Taking The Blame Game Out of Data Center Operations Behnaz Arzani,](https://reader033.vdocuments.mx/reader033/viewer/2022060917/60aa1c41b01dda306a27bdeb/html5/thumbnails/31.jpg)
What if we haven’t seen the failure before?
31
![Page 32: NetPoirot: Taking the blame game out of data center …conferences.sigcomm.org/sigcomm/2016/files/program/...NetPoirot: Taking The Blame Game Out of Data Center Operations Behnaz Arzani,](https://reader033.vdocuments.mx/reader033/viewer/2022060917/60aa1c41b01dda306a27bdeb/html5/thumbnails/32.jpg)
Performance on real applications
32
General label
Normal Client Network
Precision
97.78% 99.7% 100%
Recall 99.68% 98.25% 99.37
YouTube
Event X
![Page 33: NetPoirot: Taking the blame game out of data center …conferences.sigcomm.org/sigcomm/2016/files/program/...NetPoirot: Taking The Blame Game Out of Data Center Operations Behnaz Arzani,](https://reader033.vdocuments.mx/reader033/viewer/2022060917/60aa1c41b01dda306a27bdeb/html5/thumbnails/33.jpg)
Things we did not talk about
•
•
•
•
•
33
![Page 34: NetPoirot: Taking the blame game out of data center …conferences.sigcomm.org/sigcomm/2016/files/program/...NetPoirot: Taking The Blame Game Out of Data Center Operations Behnaz Arzani,](https://reader033.vdocuments.mx/reader033/viewer/2022060917/60aa1c41b01dda306a27bdeb/html5/thumbnails/34.jpg)
What’s next?
••
••
34
![Page 35: NetPoirot: Taking the blame game out of data center …conferences.sigcomm.org/sigcomm/2016/files/program/...NetPoirot: Taking The Blame Game Out of Data Center Operations Behnaz Arzani,](https://reader033.vdocuments.mx/reader033/viewer/2022060917/60aa1c41b01dda306a27bdeb/html5/thumbnails/35.jpg)
Conclusion
•
•
35