effective datacenter troubleshooting methodology · effective datacenter troubleshooting...
Post on 20-Apr-2018
267 Views
Preview:
TRANSCRIPT
Effective Datacenter Troubleshooting Methodology –
A case study review
Jane Gao – Technical Leader, Services
CCIE Datacenter
Rahul Parameswaran – Customer Support Engineer, DCSW
CCIE Datacenter, R&S
BRKDCT-2408
• Data Center Solution Overview
• Troubleshooting Basics
• Case Studies
• The Dos and the Donts
Agenda
Data Center Solution Overview
Cisco Unified Data Center
Unified Fabric
Unified Management
Unified Computing
Automated Resource Management
• Simplify and automate IT provisioning
• Deliver physical and virtual resources on demand
Integrated, Smart Computing Infrastructure
• Unify computing, networking, storage access, and virtualization resources
• Simplify management and enhance flexibility
Highly Scalable, Secure Network Fabric
• Deliver architectural flexibility
• Provide consistent networking across physical, virtual, and cloud environments
Ponemon Institute September 2013
When there’s downtime in Datacenter…
• Loss of employee productivity
• Loss of employee morale
• Loss of business opportunities
• Loss of revenue
• Loss of customer confidence
• Loss of customer compensation
• Damaged corporate reputation
• Loss of partner trust/confidence
Troubleshooting Basics -Methodology
How We Troubleshoot
Understanding
the problem
• Knowledge based
• Strategy based
7 * 9 =
24 * 41 =
2 4 * 4 1 =
63
984
8 _ 4 = 9841 8
The Skill Pyramid
Low complexity
High complexityHigh complexity
Low complexity
Strategy
Knowledge
Strategy
KnowledgeProblems Problems
The TAC Secret Ingredients
Troubleshoot
Apply knowledge
Identify possible causes
Test the Most
Probable cause
Break down the
issue
Understand
the problem
Confirm the root cause
Charles Kettering, inventor and head of research for GM
Step 1 - Understanding the Problem
The 5 Ws
• Who is experiencing the problem
• Why is it important
• What are the effects
• When did the problem start
• Where does the problem occur
• What is NOT the problem
The H
• How did the problem start, what has changed
Situation assessment
Problem definition
Problem 1 – Performance issue
(Initial description): Network slowness and packet drops, P1
When, what, where, who
(1st pass): Mission critical applications are failing on multiple vlans, and timeouts are seen with some applications; Users are reporting slowness and latency; No known changes prior and things were working fine until this morning. N7Ks are the core and 3rd party devices at the access layer where the hosts are connected. (How about basic connectivity?)
How, who, where
(2nd pass): Does ping work? Pings going from a test laptop to an application server is seeing 180- 240 ms latency, no drops seen on the core switches (N7Ks).
Who else, what is not, where else, to what extent
(3rd pass): Pings going from a test laptop on the same access switch as the application server, bypassing the core, experience latency around 240ms
A clear and specific problem to troubleshoot, and core may not be the problem
Problem 2 – OTV config assistance
(Initial description): Connectivity issue when SVIs are brought up at site B
When, what, where, why, what is not
(1st pass): Vlan224( vCenter ) and vlan130( hosts ) are present at site A along with SVIs. They are newly added to OTV as well as at site B. The Add Host Wizard hung when SVIs for vlan224/130 are brought up on site B. If SVIs are down it works fine. This is stopping the new deployment and deadline is approaching. (How about the basic connectivity?)
How, what, where, what is not
(2nd pass): Does ping work? Ping is going through between the sites, however Adding host is still timing out.
What are the differences
(3rd pass): Packet type and sizes. Ping with packet size 1430 is going through but not with 1431B and above
Narrowed down to MTU issue
Problem 3 – MAC flapping and packet drop
(Initial description): Network down due to connectivity issues with varies application servers
When, what, where
(1st pass): Connectivity issue is reported across the fabricpath network. It’s been working fine for over a year. There has been no known changes. MAC flapping is observed on local switch between two links going to a pair of remote FP N5K switches. One of the problematic servers is connected to the N5Ks via eVPC. (Where is flapping really happening, is this an issue with FP?)
Where exactly, what, what is not
(2nd pass): Tracked down the flapping MAC address -- it’s flapping between eVPC links on the remote N5K pair going to a FEX
Narrowed down to port-channel rather than Fabricpath
Step 2 - Troubleshoot
Break down the problem - simplify
• Network Topology
• Technology
Apply Knowledge & Experience
• How things should have worked
• What are the changes
Identify the possible causes
• Changes (known vs. unknown)
• Rule out
Test the most probable cause
• Explain the symptoms (Is and Is Not)
• Satisfy the conditions ( What, When, When, Extent)
• Use the most approachable tests
What do we need: some logic, some tools
Scenario 1 – packet drop
Hi,
We have 2 filer heads connected to Nexus on the same vlan. These filers have vpc to these 2 Nexus. We see filer head 1 with IP ip_A send 21 ping requests and receive only 17. We captured traffic at filer which sent 21 ping request and we can see we didn't get response from 4. Similarly we took capture at the other filer with IP address ip_B as per that capture we don't see 4 packets reached the filer head 2
Scenario 1 - topology
N5k01
Both filer 1 and filer 2 are dual linked to a pair of N5Ks in vPC
N5k02
Filer 2
Filer 1
21
requ
ests
17
resp
on
ses
Scenario 1 – Narrowing down
N5k01
4 different forwarding paths
- Narrow down on 1 path
- Narrow down on 1 direction
- Narrow down on 1 link or device
Tools: PACL, sniffer
N5k02
Filer 2
Filer 1
Scenario 2 – OTV config assistance“ Narrowed down to MTU issue”
Topology: DC1 – ((( SP1 ))) – ((( SP2 ))) -- … -- DC2
\-- ((( SPx))) – DC3 – ((( SP y ))) --/
- What is the traffic path for the failed ping? DC1 – DC2
- Is the issue with service provider? Ping from DC1 edge to DC2 edge with larger packet size and df-bit set - Ping is successful – The issue is within your network
- Ping from OTV join interface to local DC edge and narrow down where MTU is not set correctly
- If MTU is set correctly but the pings are not going through with larger size packetsIs it the right interface? or
Call TAC, the problem could be with hardware programming
Scenario 3 – N5K forwarding issue
We are intermittently experiencing 6% packet loss to a server. Interface counters are clean but the loss is still occurring.
Scenario 3
N7k01 N7k02
N5k01 N5k02
Back to back double vPCs
Sniffer capture on N5K1 and host1
N7k03 N7k04
host2
host1
Scenario 3
N7k01 N7k02
N5k01 N5k02
Tools: PACL, sniffer
CLIs can be your best friend --
“show port-channel load-balance forwarding-path”
N7k03 N7k04
Verifying the Root Cause
Test against the conditions:
• Does the probable cause match the problem description
• Does the probable cause satisfy all of the conditions
Test against the cause:
• Eliminate the probable cause: does the problem get eliminated?
• Reproduce the same condition: does the problem get reproduced?
The TAC Secret Ingredients
Troubleshoot
Apply knowledge
Identify Possible causes
Test probable
cause
Break down the
issue
Understanding
the problem
Confirm the root cause
To: host A and host B are seeing slowness during file transfer
• Network topology
• L2 vs. L3
• Affected hosts/vlan
• Software processing
• L2 instability
• Unicast flooding
• Faulty hardware
• Ping / Traceroute
• Working vs. non-working
• Tools: PACL, SPAN, etc.
• Forwarding path of the traffic
• L2 vs. L3
• ARP
• Difference between sites
From: Network slowness
Troubleshooting Basics –Toolkit
Sniffer Capture
• SPAN (Switched Port ANalyzer): A tool to captures traffic from the source and directs to a destination interface
• Ethanalyzer
• TCPdump
• Wireshark
Pros: Commonly available on hosts, switches, appliances ( Firewalls, load balancers ), etc
Useful for intermittent packet drop, network performance type of issues
Cons:
Could be time/resource consuming if the sniffer is not readily available
PACL
• Port Access List
IP access list TEST_ICMP
statistics per-entry
10 permit ip 1.1.1.1/32 100.100.100.100/32 [matches=10]
20 permit ip any any [matches=17642]
Pros: Commonly available on switches
Easy to use, quick to get results
Useful for intermittent packet drop, network performance type of issues
Cons:
Requires configuration changes, which may not be possible on the fly for some deployments due to change control
Available ingress direction only
Ethanalyzer• Built-in sniffer for CPU bound traffic
• ‘capture-filter’ vs. ‘display filter’
• ‘decode-internal’
• Other options
• Ethanalyzer does not
• Capture data plane traffic forwarded in hardware
• Support interface specific capture
• Ethanalyzer guides
• http://www.cisco.com/c/en/us/support/docs/switches/nexus-7000-series-switches/116136-trouble-ethanalyzer-nexus7000-00.html
• http://www.cisco.com/c/en/us/support/docs/switches/nexus-5000-series-switches/116201-technote-ethanalyzer-00.html
Logging Capabilities
• Persistent logging (Nexus 7000)
• Constant logging – event history
• Accounting log
• Commands:
• ‘show file logflash://sup-active//log/messages’
• ‘loggin level <feature> <level>’
• ‘show log logfile’, ‘show log nvram’
• ‘show accounting log’
• ‘show system internal <feature> event-history’
• ‘show <feature> internal event-history’
Granular Show Commands and CLI Filtering
• Improved IOS-like CLI
• Feature specific show commands
• ‘show run’, ‘show run <feature>’ and ‘show run all’
• ‘show’ commands can be executed from exec or config mode
• Output piping ‘show xxx | ?’
• Well structured ‘show’ commands
• ‘show system internal’
• ‘show hardware internal’
• ‘show <feature> internal’
• Useful commands
• ‘hex’ / ‘dec’
• ‘diff’
• ‘show cli history [unformatted]’
Granular Show Tech-support
• Capture show tech
• ‘show tech detail’
• ‘tac-pac’
• ‘show tech <feature>
• ‘show tech all binary’ (6.2.x feature)
• Need-to-know
• Collect show tech details as soon as possible
• Redirect the outputs to files using ‘>’
• Appending to files with ‘>>’
• Capture feature show tech in addition to show tech detail
EEM (Embedded Event Manager)
• A subsystem to automate tasks and customize the device behavior
• Event
• Notification
• Action
• Many built-in system policies: ‘show event manager system-policy’
• Event notification action
• Helpful in data gathering when the occurrence of the issue is unpredictable
ELAM (Embedded Logic Analyzer Module)
• A tool to capture a packet and determine its forwarding path within the switch
• Powerful and flexible triggering capability
• Module specific
• Available on Nexus 7000 and Nexus 6000
• Need-to-knows
• L2-4 data plane forwarding issues
• Consistent problem
• Not a replacement for capture utilities like Ethanalzyer or SPAN
• Elam guides• http://www.cisco.com/c/en/us/support/docs/switches/nexus-7000-series-switches/116648-technote-
product-00.html
• http://www.cisco.com/c/en/us/support/docs/switches/nexus-7000-series-switches/116647-technote-product-00.html
NX-OS Tools Summary• Granular show commands and CLI filtering
• Logging Capabilities
• Granular show tech-support
• GOLD (General On-Line Diagnostics)
• OBFL (On-Board Failure Logging)
• Ethanalyzer (built-in “CPU sniffer”)
• ELAM
• EEM (Embedded Event Manager)
• SPAN
• Debugs (with filters & redirection) and Debug Plugins
• Programmability
Info Collection
Hardware
Troubleshoot
OBFL (On-Board Failure Logging)
• Persistent logging
• 32MB onboard flash
• Logs varies events, for example• Reset reason
• Statistics history
• Kernel trace
• others
• Command
• ‘show logging onboard mod <x>’
GOLD (Generic OnLine Diagnostics)• A diagnostic framework runs while the system is operational
• Corrective actions are taken through Embedded Event Manager(EEM) polices
• Tests run on both Supervisors and line cards
• Tests types • Bootup
• Health Monitoring
• On-demand
• Scheduled
• Commands• ‘show diagnostics content’
• ‘show diagnostics result’
• ‘show diagnostics ?’
Debugs
• When event-history is not sufficient
• Use debug logfile ‘debug logfile <file>’
• Use debug-filter
• Debug-filter
• More granular debugs
• Can apply multiple filters simultaneously
• Commands
‘debug-filter pktmgr interface e1/1’
‘debug-filter pktmgr dest-mac 0100.5e00.000D’
‘show debug-filter all’
Programmability
• Adds control blocks in the CLI execution
• NX-API
• Python
– Cli(), Clid(), Clip()
– Interactive mode
– Noninteractive mode
• TCL– Tcl8.5, NXOS 5.1(1)
– ‘ tclsh bootflash:example.tcl’
• Search for “python API” on cisco.com
Case Studies
Case Study – Issue 1
Problem As Described
We recently migrated from Catalyst 6500 to a pair of Nexus switches in vPC.
The setup has all of our access switches connected to the pair. As soon as we cut over to the Nexus, we see performance issue across the data center. Phones stop working and a few servers are not reachable. Everything was working fine before the cut over
Nexus Pair Cat 6500
FTP Server
Users
Multiple problems are described here. Best we handle one problem at a time
1a) Slowness across Data Center
1b) Phones are not working
1c) Some servers are not reachable
Nexus Pair Cat 6500
FTP Server
Users
Issue 1a – Slowness Across Data Center
Asking the right questions
Slowness across Data Center
What application(s) specifically that users have problems accessing (say FTP)
Q) Is the user able to start the FTP process – YES
Q) Define slow - When files is being downloaded it takes 30 minutes instead of 3-5 minutes earlier
Q) Are all users affected? Not sure, all users are on VLAN 50.
Q) What do you observe when a user is moved to a different VLAN – Works Fine
Q) Are all users in VLAN 50 connected to the same access layer switch – NO – does not matter where the user is connected, if the machine is on VLAN 50 FTP is slow –Look for common point in the network where the traffic goes – Nexus pair
We have narrowed down the issue to be with the Nexus pair – lets dig deeper.
What to look for
Interface Drops – Given the problem is specifically with VLAN 50, very unlikely drops on the port as traffic from other VLANs traverse the same physical links
Spanning Tree – Use CLI commands, e.g. “show spanning-tree detail” to look for excessive Topology Change Notification, as too many TCNs result in flooding
Forwarding path -- See if the traffic is getting software switched for any reason, we can use a tool called Ethanalyzer
Ethanalyzer – See if Packets are CPU switched
Ethanalyzer is a tool to monitor traffic to and from the CPU
N7k#ethanalyzer local interface inband capture-filter tcp
2014-01-28 17:19:55.730066 10.127.76.137 -> 10.8.51.4 TCP [TCP Dup ACK 617#20] microsoft-ds > venus [ACK]
Seq=1 Ack=1 Win=17520 Len=0
2014-01-28 17:19:55.730193 10.127.76.137 -> 10.8.51.4 TCP [TCP Dup ACK 617#21] microsoft-ds > venus [ACK]
Seq=1 Ack=1 Win=17520 Len=0
2014-01-28 17:19:55.730340 10.127.76.137 -> 10.8.51.4 TCP [TCP Dup ACK 617#22] microsoft-ds > venus [ACK]
Seq=1 Ack=1 Win=17520 Len=0
Aha!
Traffic getting software switched
We have narrowed down the slowness to be a result of packets getting software switched.
What can cause traffic to go to CPU
• Packets with IP options
• Packets with length greater than MTU causing it to be fragmented
• IP redirects – When the next hop is reachable on same VLAN as the one the packet comes in on – Check IP route for source and destination IP (‘no ip redirects’ under interface vlan)
• Hardware Misprogramming – Call TAC
Nexus Pair Cat 6500
FTP Server
Users
Issue 1b – Phones Not Working
DHCP Server
Asking the right questions
Q) Are all phones down – No, only phones in VLAN 10
Q) Is it rebooting? What phase in bootup is failing? Unable to get IP address
Q) Packet capture on DHCP server – No discovers from these specific phones are seen
Q) Are all phones in VLAN 10 connected to one switch – No, it is across the network
Start looking from the CORE – Nexus pair
Asking the right questions
Q) What is different about VLAN 10? Nothing different
Q) What VLAN is the DHCP server in? 3 servers, one in VLAN 10 and others in VLAN 20 and 30 , but the one in VLAN 10 is the only active DHCP server at this point.
DHCP server in same VLAN as the phone – Good data point
Q) No discovers from these specific phones are seen on the server, does it reach the Nexus switches? Let’s use SPAN/Ethanalyzer to confirm
Ethanalyzer – See if DHCP relay works
sw1(config-if)# ethanalyzer local interface inband capture-filter "port 67" limit-
captured-frames 0
Capturing on inband
2014-05-01 06:20:41.793378 0.0.0.0 -> 255.255.255.255 DHCP DHCP Discover -
Transaction ID 0x3e96b16d
2014-05-01 06:20:41.793763 10.7.1.2 -> 10.5.1.220 DHCP DHCP Discover -
Transaction ID 0x3e96b16d
2014-05-01 06:20:41.793763 10.7.1.2 -> 10.6.1.220 DHCP DHCP Discover -
Transaction ID 0x3e96b16d
Nexus switch is sending the relayed packets to servers in VLAN 20 and 30 that are not active.
We are assuming the DISCOVER being a broadcast packet would make it to the server on the same VLAN
NOT TRUE
On Nexus when relay is used, we need to specify the DHCP server even when it is on the same VLAN
Nexus Pair Cat 6500
Issue 1c – Some Servers are not reachable
Servers
Asking the right questions
Q) Are you able to ping the server from Nexus – YES
Q) Are you able to ping the server from Nexus sourcing a different VLAN IP – No
Something must be wrong with default gateway setting – But why did it work with Cat6500 then?
Let’s see how the packets look like using ethanalyzer, when we are pinging from the Nexus
Ethanalyzer : See what traffic is received
N7k#ethanalyzer local interface inband capture-filter “arp or host
172.16.3.41"
2014-03-20 15:15:42.996037 00:0f:bb:18:c5:70 -> ff:ff:ff:ff:ff:ff
ARP Who has 192.168.1.1? Tell 172.16.3.41
This does not look right, the end server is ARPing for the destination instead of its gateway.
Need Nexus to proxy reply to the ARP .
It worked with Cat6500 because proxy ARP is on by default. On Nexus platforms proxy ARP is disabled by default.
Conclusion
How did we solve these issues
1) Taking them one issue at a time
2) Asking the right questions to narrow down the scope of the problem
3) Using logical reasoning and right tools to root cause the problem
Case Study – Issue 2
Problem As Described
At 1AM everyday there is a network outage. Servers lose access to the network.
Asking the right questions
Are there any scheduled activities that take place at 1AM every night? None we are aware of
Is the whole network affected? Yes, servers across multiple segments affected
What is done to restore connectivity? Nothing, it recovers by itself
How long does this last? For 2-3 minutes
Where do we start
Since the entire network is affected, start from the CORE
Check logs for any activities at 1AM for the last few days
Check for any STP event, routing changes at that time
Logs
show logging log:
2013 Apr 7 01:00:03 Nexus %VPC-2-PEER_KEEP_ALIVE_RECV_FAIL: In domain 10, VPC peer
keep-alive receive has failed
2013 Apr 7 01:00:22 Nexus %STP-2-DISPUTE_DETECTED: Dispute detected on port
Ethernet10/13 on VLAN1616
2013 Apr 7 01:00:30 Nexus %STP-2-DISPUTE_CLEARED: Dispute resolved for port
Ethernet10/13 on VLAN1616
2013 Apr 7 01:02:36 Nexus %BGP-3-NOTIFICATION: sent to neighbor 10.137.29.9 4/0
(hold time expired) 0 bytes
show accounting log:
Sun Apr 7 01:00:01 2013:type=start:id=171.69.89.32@pts/0:user=prime:cmd=
Sun Apr 7 01:00:03 2013:type=update:id=171.69.89.32@pts/0:user=prime:cmd=show
access-list
CPU History
Multiple control plane protocols were affected – most likely CPU spiked
Show processes cpu history
How do we find out what process spiked?
EEM script
event manager applet HIGH-CPU
event snmp oid 1.3.6.1.4.1.9.9.109.1.1.1.1.6.1 get-type exact entry-op ge
entry-val 90 exit-val 50 poll-interval 5
action 1.0 syslog msg High CPU hit $_event_pub_time
action 2.0 cli enable
action 3.0 cli show clock >> bootflash:high-cpu.txt
action 4.0 cli show processes cpu sort >> bootflash:high-cpu.txt
Output of EEM Script
PID Runtime(ms) Invoked uSecs 1Sec Process
----- ----------- -------- ----- ------ -----------
4857 5986662 12668354 472 45.7% aclmgr
7069 770167668 689087915 1117 11.5% statsclient
4720 516729732 166565721 3102 6.7% oc_usd
5534 216 41 5269 6.7% pim
4915 262 702 374 5.8% netstack
5485 899469787 2147483647 236 3.8% stp
4667 108772307 105958793 1026 2.9% R2D2_usd
ACLMGR is the process
invoked while ACL configuration
is polled or changed
What was found
Tool was polling ACL configuration from the device which has over 32k ACEs
This caused ACLMGR to spike, affecting other protocols
Behavior root caused to a bug which was fixed in a newer code release
Case Study – Issue 3
Problem As Described
Network was severely degraded for several hours. It eventually recovered by itself, need to understand what caused the issue – RCA (root cause analysis)
Asking the right questions
Q) What specific applications were affected? All applications
Q) Were users unable to connect at all or was it slow? Both
Q) What changes were made if any ? No known changes
Q) What was done to fix this ? Nothing known
Q) What time did the issue start? How long did it last? Issue started at around 1PM and stabilized at around 11:30PM
What to check? Where to start?
Network wide events – Check Spanning Tree events, routing loops, broadcast storm
Starting from the core, check all switches for interface drops, error logs
Inspect CoPP for any drops on the Core Switches
Check monitoring tools for alerts on high utilization of the links, high cpu spikes, failures, etc.
Checking Spanning-tree
Nexus# show spanning-tree detail
VLAN0100 is executing the rstp compatible Spanning Tree protocol
Bridge Identifier has priority 0, sysid 1, address c84c.75fa.6000
Configured hello time 2, max age 20, forward delay 15
We are the root of the spanning tree
Topology change flag not set, detected flag not set
Number of topology changes 6 last change occurred 160:10:57 ago >>> Spanning Tree
looks stable
from port-channel20
Times: hold 1, topology change 35, notification 2
hello 2, max age 20, forward delay 15
Timers: hello 0, topology change 0, notification 0
Checking interface drops
Nexus# show interface counter error
--------------------------------------------------------------------------------
Port Align-Err FCS-Err Xmit-Err Rcv-Err UnderSize OutDiscards
--------------------------------------------------------------------------------
Eth3/1 0 0 0 0 0 0
Eth3/2 0 0 0 0 0 105430
Eth3/3 0 0 0 0 0 105421
Eth3/4 0 0 0 0 0 105794
Similar output drop
counter
Checking Control Plane Policing Drops
Nexus# show policy-map interface control-plane
<snip>
class-map copp-system-p-class-normal (match-any)
[snip]
match protocol arp
set cos 1
police cir 680 kbps bc 250 ms
conform action: transmit
violate action: drop
module 3:
conformed 4582560313 bytes,
5-min offered rate 3452 bytes/sec
violated 37822500313 bytes,
5-min violate rate 0 bytes/sec
Huge violations in ARP class – Either a loop in the
network or a misbehaving device flooding network
with ARP traffic
Check SNMP server for link utilization graph
What have we learnt so far?
Clearly a broadcast storm in the network
Storm is that of ARP packets
ARP storm has caused congestion and could have ended up dropping valid ARP packets
WHAT CAUSED IT / WHERE DID IT ORIGINATE ???
How to we identify what caused the storm
Track the origin using simple “show interface counter”
Start from the CORE and inspect which interface has very high InBroadcast
--------------------------------------------------------------------------------
Port InMcastPkts InBcastPkts
--------------------------------------------------------------------------------
<snip>
Eth3/10 336 15315895309
How to we identify what caused the storm
Track the origin using simple “show interface counter”
Track it to access layer switch to determine where this could have come from
--------------------------------------------------------------------------------
Port InMcastPkts InBcastPkts
--------------------------------------------------------------------------------
<snip>
Eth105/1/11 336 1531589530
Eth105/1/12 336 1527225940
<snip>
Eth107/1/3 337 1544275579
Conclusion
All of this traffic was a storm origination from a few servers
New application was deployed on those servers
Server team had reverted the changes at around11PM (Did not tell anyone )
What could have prevented this?
Designing network using best practice such as STORM control
Maintain a log of all changes made so that it can be referenced when needed
Issue 4 : Intermittent connectivity loss to servers
MPLS PE
Servers
Users
MPLS PE MPLS P
Default VRF
Server VRF
Asking the right questions
Q) How often Users lose connectivity to servers? Random , no pattern
Q) Is connectivity lost to all servers ? No, a few servers are reachable and a few are not
Q) Do all users lose connectivity to a specific server? Yes
Q) What do you do to fix it? Nothing , it resolves by itself
Q) Is the server reachable from its default gateway? When we try to ping server from its gateway, first few pings fail after which server is reachable from its gateway and almost immediately USERs are able to connect
Inspect if ARP Glean works
Ethanalyzer – Check if packets are punted to CPU for it to generate an ARP request
N7k#ethanalyzer local interface inband capture-filter “arp or host
172.25.3.41"
Nothing seen in Ethanalyzer – ARP never completes as ARP request is not generated.This is why USERs lose access to Servers
Hardware Rate Limiter
Hardware rate limiters are in place to protect CPU (like CoPP)
N7k# show hardware rate-limiter layer-3 glean
Units for Config: packets per second
Allowed, Dropped & Total: aggregated since last clear counters
Rate Limiter Class Parameters
------------------------------------------------------------
layer-3 glean Config : 100
Allowed : 10146910
Dropped : 4636432 >> Increasing at a rapid rate
Total : 14783342
Conclusion
By asking the right questions we were able to rule out the complex MPLS network as the cause.
With the knowledge of how glean comes to play we narrowed down the issue to be a result of excess Glean traffic causing hardware rate limiter to kick in.
Glean Throttling was implemented to fix this issue.
The Dos and The Donts
The Dos
• Understand how things should work (on your network)
• Identify the broken scenario – define “broken” and/or “not working”
• Determine possible triggers, patterns, time frame
• Use solid troubleshooting techniques, start with basics
• Capture valuable information
• Ask the right questions
The Dos - Continued
• Stay calm
• Bring all relevant parties to the table
• Backup
• Documentation (network topology, traffic flow, IP addressing, etc.)
• Network Management
The DONTs
• Jump to conclusion – most time it’s not a bug
• Take drastic measures prematurely
• 'let's bounce the datacenter'
• 'we are reloading the switches one at a time'
• Lump all issues together
• Make multiple changes at once
• Status update and technical call in one
The TAC secret ingredients
Troubleshoot
Apply Knowledge
Identify possible causes
Test the Most
Probable cause
Break down the issue
Understanding
the problem
Confirm the root cause
Complete Your Online Session Evaluation
Don’t forget: Cisco Live sessions will be available for viewing on-demand after the event at CiscoLive.com/Online
• Give us your feedback to be entered into a Daily Survey Drawing. A daily winner will receive a $750 Amazon gift card.
• Complete your session surveys though the Cisco Live mobile app or your computer on Cisco Live Connect.
Continue Your Education
• Demos in the Cisco campus
• Walk-in Self-Paced Labs
• Table Topics
• Meet the Engineer 1:1 meetings
• Related sessions
Q&A
Thank you
top related