monday general hoose monitoring-troubleshooting · monitoring and troubleshooting peter hoose...
TRANSCRIPT
Monitoring and Troubleshooting
Peter Hoosepacket loss hater, facebook
One Engineer’s RantSolutions, lessons learned and what not to do
Brendan Cleary, Lance Dryden, Francisco Hidalgo, Peter Hoose, Ernesto Ovcharenko, Petr Lapukhov, Jose Leitao, James Paussa, Jimmy Williams, Nathan Bronson
Microburstsrouter1# show int eth18/1 Ethernet18/1 is up admin state is up, Dedicated Interface … MTU 1500 bytes, BW 10000000 Kbit, DLY 10 usec … Last link flapped 7week(s) 5day(s) … input rate 1.75 Gbps, 325.01 Kpps; output rate 2.10 Gbps, 371.97 Kpps RX … 0 jumbo packets 0 storm suppression packets 0 runts 0 giants 0 CRC 0 no buffer 0 input error 0 short frame 0 overrun 0 underrun 0 ignored 0 watchdog 0 bad etype drop 0 bad proto drop 0 if down drop 0 input with dribble 0 input discard 0 Rx pause TX … 0 jumbo packets 0 output error 0 collision 0 deferred 0 late collision 0 lost carrier 0 no carrier 0 babble 0 output discard 0 Tx pause
router1#
Microburstsrouter1# show int eth18/1 Ethernet18/1 is up admin state is up, Dedicated Interface … MTU 1500 bytes, BW 10000000 Kbit, DLY 10 usec … Last link flapped 7week(s) 5day(s) … input rate 1.75 Gbps, 325.01 Kpps; output rate 2.10 Gbps, 371.97 Kpps RX … 0 jumbo packets 0 storm suppression packets 0 runts 0 giants 0 CRC 0 no buffer 0 input error 0 short frame 0 overrun 0 underrun 0 ignored 0 watchdog 0 bad etype drop 0 bad proto drop 0 if down drop 0 input with dribble 0 input discard 0 Rx pause TX … 0 jumbo packets 0 output error 0 collision 0 deferred 0 late collision 0 lost carrier 0 no carrier 0 babble 0 output discard 0 Tx pause
router1#
Microburstsrouter1# show int eth18/1 Ethernet18/1 is up admin state is up, Dedicated Interface … MTU 1500 bytes, BW 10000000 Kbit, DLY 10 usec … Last link flapped 7week(s) 5day(s) … input rate 1.75 Gbps, 325.01 Kpps; output rate 2.10 Gbps, 371.97 Kpps RX … 0 jumbo packets 0 storm suppression packets 0 runts 0 giants 0 CRC 0 no buffer 0 input error 0 short frame 0 overrun 0 underrun 0 ignored 0 watchdog 0 bad etype drop 0 bad proto drop 0 if down drop 0 input with dribble 0 input discard 0 Rx pause TX … 0 jumbo packets 0 output error 0 collision 0 deferred 0 late collision 0 lost carrier 0 no carrier 0 babble 0 output discard 0 Tx pause
router1#
Microburstsrouter1# show int eth18/1 Ethernet18/1 is up admin state is up, Dedicated Interface … MTU 1500 bytes, BW 10000000 Kbit, DLY 10 usec … Last link flapped 7week(s) 5day(s) … input rate 1.75 Gbps, 325.01 Kpps; output rate 2.10 Gbps, 371.97 Kpps RX … 0 jumbo packets 0 storm suppression packets 0 runts 0 giants 0 CRC 0 no buffer 0 input error 0 short frame 0 overrun 0 underrun 0 ignored 0 watchdog 0 bad etype drop 0 bad proto drop 0 if down drop 0 input with dribble 0 input discard 0 Rx pause TX … 0 jumbo packets 0 output error 0 collision 0 deferred 0 late collision 0 lost carrier 0 no carrier 0 babble 0 output discard 0 Tx pause
router1#
Microburstsrouter1# show int eth18/1 Ethernet18/1 is up admin state is up, Dedicated Interface … MTU 1500 bytes, BW 10000000 Kbit, DLY 10 usec … Last link flapped 7week(s) 5day(s) … input rate 1.75 Gbps, 325.01 Kpps; output rate 2.10 Gbps, 371.97 Kpps RX … 0 jumbo packets 0 storm suppression packets 0 runts 0 giants 0 CRC 0 no buffer 0 input error 0 short frame 0 overrun 0 underrun 0 ignored 0 watchdog 0 bad etype drop 0 bad proto drop 0 if down drop 0 input with dribble 0 input discard 0 Rx pause TX … 0 jumbo packets 0 output error 0 collision 0 deferred 0 late collision 0 lost carrier 0 no carrier 0 babble 0 output discard 0 Tx pause
router1#
Microbursts
[root@localhost# hping -S 192.168.0.100 HPING 192.168.0.100 (eth0 192.168.0.100): S set, 40 headers + 0 data bytes len=46 ip=192.168.0.100 ttl=128 id=19314 sport=0 flags=RA seq=0 win=0 rtt=0.5 ms len=46 ip=192.168.0.100 ttl=128 id=19316 sport=0 flags=RA seq=1 win=0 rtt=0.5 ms len=46 ip=192.168.0.100 ttl=128 id=19317 sport=0 flags=RA seq=2 win=0 rtt=0.4 ms — 192.168.0.100 hping statistic — 4 packets tramitted, 3 packets received, 25% packet loss round-trip min/avg/max = 0.4/0.8/1.6 ms [root@localhost]#
25% packet loss
Microbursts2
7
15
26
33
39
78
105
194
423
814
1580
2985
5672
10455
18756
32994
52692
81252
121870
6 12 12 15 16 39 34 21
9 23 9 18 14 35 14 26
2119 30 25
25
366429 62
28 16 44 83 36 8226
49 67 36 51 62 151 55 145
23410226911154597555
99 129 90 93 174 367 157 377
617252657348175157283173
921
561
424 325 312 569 1100 469 996
15237471668913560538868
1058 1408 1034 998 1630 2617 1266 2375
2050415626571755181424961984
3584 4382 3167 3067 4506 6616 3480
3654
5513
837955171016173425342591675476619
11905 12952 9867 8985 11685 14951 8794 12091
20228 20385 16346 14613 18169 21842 13919 18288
2697422793311262871124552273663120533371
51041 47161 44246 41122 44790 44829 39042 40680
6852973774706537558772616758807352378061
120823 126986 143074 155605 132631 118563 157441 1399380-5050-100
100-150150-200200-250250-300300-350350-400400-450450-500500-550550-600600-650650-700700-750750-800800-850850-900900-950
950-1000Th
roug
hput
(mbp
s)
Count of Measurements
Microbursts2
7
15
26
33
39
78
105
194
423
814
1580
2985
5672
10455
18756
32994
52692
81252
121870
6 12 12 15 16 39 34 21
9 23 9 18 14 35 14 26
2119 30 25
25
366429 62
28 16 44 83 36 8226
49 67 36 51 62 151 55 145
23410226911154597555
99 129 90 93 174 367 157 377
617252657348175157283173
921
561
424 325 312 569 1100 469 996
15237471668913560538868
1058 1408 1034 998 1630 2617 1266 2375
2050415626571755181424961984
3584 4382 3167 3067 4506 6616 3480
3654
5513
837955171016173425342591675476619
11905 12952 9867 8985 11685 14951 8794 12091
20228 20385 16346 14613 18169 21842 13919 18288
2697422793311262871124552273663120533371
51041 47161 44246 41122 44790 44829 39042 40680
6852973774706537558772616758807352378061
120823 126986 143074 155605 132631 118563 157441 1399380-5050-100
100-150150-200200-250250-300300-350350-400400-450450-500500-550550-600600-650650-700700-750750-800800-850850-900900-950
950-1000
Count of Measurements
Thro
ughp
ut (m
bps)
Microbursts2
7
15
26
33
39
78
105
194
423
814
1580
2985
5672
10455
18756
32994
52692
81252
121870
6 12 12 15 16 39 34 21
9 23 9 18 14 35 14 26
2119 30 25
25
366429 62
28 16 44 83 36 8226
49 67 36 51 62 151 55 145
23410226911154597555
99 129 90 93 174 367 157 377
617252657348175157283173
921
561
424 325 312 569 1100 469 996
15237471668913560538868
1058 1408 1034 998 1630 2617 1266 2375
2050415626571755181424961984
3584 4382 3167 3067 4506 6616 3480
3654
5513
837955171016173425342591675476619
11905 12952 9867 8985 11685 14951 8794 12091
20228 20385 16346 14613 18169 21842 13919 18288
2697422793311262871124552273663120533371
51041 47161 44246 41122 44790 44829 39042 40680
6852973774706537558772616758807352378061
120823 126986 143074 155605 132631 118563 157441 1399380-5050-100
100-150150-200200-250250-300300-350350-400400-450450-500500-550550-600600-650650-700700-750750-800800-850850-900900-950
950-1000
Count of Measurements
Thro
ughp
ut (m
bps)
Microbursts2
7
15
26
33
39
78
105
194
423
814
1580
2985
5672
10455
18756
32994
52692
81252
121870
6 12 12 15 16 39 34 21
9 23 9 18 14 35 14 26
2119 30 25
25
366429 62
28 16 44 83 36 8226
49 67 36 51 62 151 55 145
23410226911154597555
99 129 90 93 174 367 157 377
617252657348175157283173
921
561
424 325 312 569 1100 469 996
15237471668913560538868
1058 1408 1034 998 1630 2617 1266 2375
2050415626571755181424961984
3584 4382 3167 3067 4506 6616 3480
3654
5513
837955171016173425342591675476619
11905 12952 9867 8985 11685 14951 8794 12091
20228 20385 16346 14613 18169 21842 13919 18288
2697422793311262871124552273663120533371
51041 47161 44246 41122 44790 44829 39042 40680
6852973774706537558772616758807352378061
120823 126986 143074 155605 132631 118563 157441 1399380-5050-100
100-150150-200200-250250-300300-350350-400400-450450-500500-550550-600600-650650-700700-750750-800800-850850-900900-950
950-1000
Count of Measurements
Thro
ughp
ut (m
bps)
Microbursts - Lessons Learned• Resolved issues
• Root Cause
• Software helps
• Service owner identified
• Resolution time
• Small loss, significant impact
Link Imbalance - Lessons Learned• Resolved issues
• Root Cause
• Software helps
• Service owner identified
• Resolution time
• Small loss, significant impact
DIP Loss Latency1.1.1.1 0.1 102.2.2.2 0 10
DIP Loss Latency1.1.1.1 0.1 10
Servers ServersServers
Servers ServersServers
Detection
loss effects on throughput1000
400
00.00000% 0.00001% 0.00010% 0.00100% 0.01000% 0.10000% 1.00000%
Packet Loss %
Thro
ughp
ut (m
bps) 800
600
200
RTT0
X20025
XXXX
XX
loss effects on throughput1000
400
00.00000% 0.00001% 0.00010% 0.00100% 0.01000% 0.10000% 1.00000%
Packet Loss %
Thro
ughp
ut (m
bps) 800
600
200
XXXX
XX
0
X20025
RTT
-50%
XX
XX X X0
0.00000% 0.00001% 0.00010% 0.00100% 0.01000% 0.10000% 1.00000%Packet Loss %
Thro
ughp
ut (m
bps)
350
50
100
150
250
300
200
Different algos?
X
RenoCubicVegasIllinois
XX
XX X X0
0.00000% 0.00001% 0.00010% 0.00100% 0.01000% 0.10000% 1.00000%Packet Loss %
Thro
ughp
ut (m
bps)
350
50
100
150
250
300
200
Different algos?
X
RenoCubicVegasIllinois
4x
Recovery time
0
Thro
ughp
ut (m
bps)
350
50
100
150
250
300
200
Time (sec)0 120110100908070605040302010
1% P
acke
t Los
s RenoCubicVegasIllinois
Recovery time
0
Thro
ughp
ut (m
bps)
350
50
100
150
250
300
200
Time (sec)0 120110100908070605040302010
1% P
acke
t Los
s RenoCubicVegasIllinois
14x