effective datacenter troubleshooting methodology · effective datacenter troubleshooting...

94

Upload: dotram

Post on 20-Apr-2018

266 views

Category:

Documents


4 download

TRANSCRIPT

Page 1: Effective Datacenter Troubleshooting Methodology · Effective Datacenter Troubleshooting Methodology ... Secure Network Fabric ... (built-in “CPU sniffer”) • ELAM • EEM (Embedded
Page 2: Effective Datacenter Troubleshooting Methodology · Effective Datacenter Troubleshooting Methodology ... Secure Network Fabric ... (built-in “CPU sniffer”) • ELAM • EEM (Embedded

Effective Datacenter Troubleshooting Methodology –

A case study review

Jane Gao – Technical Leader, Services

CCIE Datacenter

Rahul Parameswaran – Customer Support Engineer, DCSW

CCIE Datacenter, R&S

BRKDCT-2408

Page 3: Effective Datacenter Troubleshooting Methodology · Effective Datacenter Troubleshooting Methodology ... Secure Network Fabric ... (built-in “CPU sniffer”) • ELAM • EEM (Embedded

• Data Center Solution Overview

• Troubleshooting Basics

• Case Studies

• The Dos and the Donts

Agenda

Page 4: Effective Datacenter Troubleshooting Methodology · Effective Datacenter Troubleshooting Methodology ... Secure Network Fabric ... (built-in “CPU sniffer”) • ELAM • EEM (Embedded

Data Center Solution Overview

Page 5: Effective Datacenter Troubleshooting Methodology · Effective Datacenter Troubleshooting Methodology ... Secure Network Fabric ... (built-in “CPU sniffer”) • ELAM • EEM (Embedded

Cisco Unified Data Center

Unified Fabric

Unified Management

Unified Computing

Automated Resource Management

• Simplify and automate IT provisioning

• Deliver physical and virtual resources on demand

Integrated, Smart Computing Infrastructure

• Unify computing, networking, storage access, and virtualization resources

• Simplify management and enhance flexibility

Highly Scalable, Secure Network Fabric

• Deliver architectural flexibility

• Provide consistent networking across physical, virtual, and cloud environments

Page 6: Effective Datacenter Troubleshooting Methodology · Effective Datacenter Troubleshooting Methodology ... Secure Network Fabric ... (built-in “CPU sniffer”) • ELAM • EEM (Embedded

Ponemon Institute September 2013

Page 7: Effective Datacenter Troubleshooting Methodology · Effective Datacenter Troubleshooting Methodology ... Secure Network Fabric ... (built-in “CPU sniffer”) • ELAM • EEM (Embedded

When there’s downtime in Datacenter…

• Loss of employee productivity

• Loss of employee morale

• Loss of business opportunities

• Loss of revenue

• Loss of customer confidence

• Loss of customer compensation

• Damaged corporate reputation

• Loss of partner trust/confidence

Page 8: Effective Datacenter Troubleshooting Methodology · Effective Datacenter Troubleshooting Methodology ... Secure Network Fabric ... (built-in “CPU sniffer”) • ELAM • EEM (Embedded

Troubleshooting Basics -Methodology

Page 9: Effective Datacenter Troubleshooting Methodology · Effective Datacenter Troubleshooting Methodology ... Secure Network Fabric ... (built-in “CPU sniffer”) • ELAM • EEM (Embedded

How We Troubleshoot

Understanding

the problem

• Knowledge based

• Strategy based

7 * 9 =

24 * 41 =

2 4 * 4 1 =

63

984

8 _ 4 = 9841 8

Page 10: Effective Datacenter Troubleshooting Methodology · Effective Datacenter Troubleshooting Methodology ... Secure Network Fabric ... (built-in “CPU sniffer”) • ELAM • EEM (Embedded

The Skill Pyramid

Low complexity

High complexityHigh complexity

Low complexity

Strategy

Knowledge

Strategy

KnowledgeProblems Problems

Page 11: Effective Datacenter Troubleshooting Methodology · Effective Datacenter Troubleshooting Methodology ... Secure Network Fabric ... (built-in “CPU sniffer”) • ELAM • EEM (Embedded

The TAC Secret Ingredients

Troubleshoot

Apply knowledge

Identify possible causes

Test the Most

Probable cause

Break down the

issue

Understand

the problem

Confirm the root cause

Page 12: Effective Datacenter Troubleshooting Methodology · Effective Datacenter Troubleshooting Methodology ... Secure Network Fabric ... (built-in “CPU sniffer”) • ELAM • EEM (Embedded

Charles Kettering, inventor and head of research for GM

Page 13: Effective Datacenter Troubleshooting Methodology · Effective Datacenter Troubleshooting Methodology ... Secure Network Fabric ... (built-in “CPU sniffer”) • ELAM • EEM (Embedded

Step 1 - Understanding the Problem

The 5 Ws

• Who is experiencing the problem

• Why is it important

• What are the effects

• When did the problem start

• Where does the problem occur

• What is NOT the problem

The H

• How did the problem start, what has changed

Situation assessment

Problem definition

Page 14: Effective Datacenter Troubleshooting Methodology · Effective Datacenter Troubleshooting Methodology ... Secure Network Fabric ... (built-in “CPU sniffer”) • ELAM • EEM (Embedded

Problem 1 – Performance issue

(Initial description): Network slowness and packet drops, P1

When, what, where, who

(1st pass): Mission critical applications are failing on multiple vlans, and timeouts are seen with some applications; Users are reporting slowness and latency; No known changes prior and things were working fine until this morning. N7Ks are the core and 3rd party devices at the access layer where the hosts are connected. (How about basic connectivity?)

How, who, where

(2nd pass): Does ping work? Pings going from a test laptop to an application server is seeing 180- 240 ms latency, no drops seen on the core switches (N7Ks).

Who else, what is not, where else, to what extent

(3rd pass): Pings going from a test laptop on the same access switch as the application server, bypassing the core, experience latency around 240ms

A clear and specific problem to troubleshoot, and core may not be the problem

Page 15: Effective Datacenter Troubleshooting Methodology · Effective Datacenter Troubleshooting Methodology ... Secure Network Fabric ... (built-in “CPU sniffer”) • ELAM • EEM (Embedded

Problem 2 – OTV config assistance

(Initial description): Connectivity issue when SVIs are brought up at site B

When, what, where, why, what is not

(1st pass): Vlan224( vCenter ) and vlan130( hosts ) are present at site A along with SVIs. They are newly added to OTV as well as at site B. The Add Host Wizard hung when SVIs for vlan224/130 are brought up on site B. If SVIs are down it works fine. This is stopping the new deployment and deadline is approaching. (How about the basic connectivity?)

How, what, where, what is not

(2nd pass): Does ping work? Ping is going through between the sites, however Adding host is still timing out.

What are the differences

(3rd pass): Packet type and sizes. Ping with packet size 1430 is going through but not with 1431B and above

Narrowed down to MTU issue

Page 16: Effective Datacenter Troubleshooting Methodology · Effective Datacenter Troubleshooting Methodology ... Secure Network Fabric ... (built-in “CPU sniffer”) • ELAM • EEM (Embedded

Problem 3 – MAC flapping and packet drop

(Initial description): Network down due to connectivity issues with varies application servers

When, what, where

(1st pass): Connectivity issue is reported across the fabricpath network. It’s been working fine for over a year. There has been no known changes. MAC flapping is observed on local switch between two links going to a pair of remote FP N5K switches. One of the problematic servers is connected to the N5Ks via eVPC. (Where is flapping really happening, is this an issue with FP?)

Where exactly, what, what is not

(2nd pass): Tracked down the flapping MAC address -- it’s flapping between eVPC links on the remote N5K pair going to a FEX

Narrowed down to port-channel rather than Fabricpath

Page 17: Effective Datacenter Troubleshooting Methodology · Effective Datacenter Troubleshooting Methodology ... Secure Network Fabric ... (built-in “CPU sniffer”) • ELAM • EEM (Embedded

Step 2 - Troubleshoot

Break down the problem - simplify

• Network Topology

• Technology

Apply Knowledge & Experience

• How things should have worked

• What are the changes

Identify the possible causes

• Changes (known vs. unknown)

• Rule out

Test the most probable cause

• Explain the symptoms (Is and Is Not)

• Satisfy the conditions ( What, When, When, Extent)

• Use the most approachable tests

What do we need: some logic, some tools

Page 18: Effective Datacenter Troubleshooting Methodology · Effective Datacenter Troubleshooting Methodology ... Secure Network Fabric ... (built-in “CPU sniffer”) • ELAM • EEM (Embedded

Scenario 1 – packet drop

Hi,

We have 2 filer heads connected to Nexus on the same vlan. These filers have vpc to these 2 Nexus. We see filer head 1 with IP ip_A send 21 ping requests and receive only 17. We captured traffic at filer which sent 21 ping request and we can see we didn't get response from 4. Similarly we took capture at the other filer with IP address ip_B as per that capture we don't see 4 packets reached the filer head 2

Page 19: Effective Datacenter Troubleshooting Methodology · Effective Datacenter Troubleshooting Methodology ... Secure Network Fabric ... (built-in “CPU sniffer”) • ELAM • EEM (Embedded

Scenario 1 - topology

N5k01

Both filer 1 and filer 2 are dual linked to a pair of N5Ks in vPC

N5k02

Filer 2

Filer 1

21

requ

ests

17

resp

on

ses

Page 20: Effective Datacenter Troubleshooting Methodology · Effective Datacenter Troubleshooting Methodology ... Secure Network Fabric ... (built-in “CPU sniffer”) • ELAM • EEM (Embedded

Scenario 1 – Narrowing down

N5k01

4 different forwarding paths

- Narrow down on 1 path

- Narrow down on 1 direction

- Narrow down on 1 link or device

Tools: PACL, sniffer

N5k02

Filer 2

Filer 1

Page 21: Effective Datacenter Troubleshooting Methodology · Effective Datacenter Troubleshooting Methodology ... Secure Network Fabric ... (built-in “CPU sniffer”) • ELAM • EEM (Embedded

Scenario 2 – OTV config assistance“ Narrowed down to MTU issue”

Topology: DC1 – ((( SP1 ))) – ((( SP2 ))) -- … -- DC2

\-- ((( SPx))) – DC3 – ((( SP y ))) --/

- What is the traffic path for the failed ping? DC1 – DC2

- Is the issue with service provider? Ping from DC1 edge to DC2 edge with larger packet size and df-bit set - Ping is successful – The issue is within your network

- Ping from OTV join interface to local DC edge and narrow down where MTU is not set correctly

- If MTU is set correctly but the pings are not going through with larger size packetsIs it the right interface? or

Call TAC, the problem could be with hardware programming

Page 22: Effective Datacenter Troubleshooting Methodology · Effective Datacenter Troubleshooting Methodology ... Secure Network Fabric ... (built-in “CPU sniffer”) • ELAM • EEM (Embedded

Scenario 3 – N5K forwarding issue

We are intermittently experiencing 6% packet loss to a server. Interface counters are clean but the loss is still occurring.

Page 23: Effective Datacenter Troubleshooting Methodology · Effective Datacenter Troubleshooting Methodology ... Secure Network Fabric ... (built-in “CPU sniffer”) • ELAM • EEM (Embedded

Scenario 3

N7k01 N7k02

N5k01 N5k02

Back to back double vPCs

Sniffer capture on N5K1 and host1

N7k03 N7k04

host2

host1

Page 24: Effective Datacenter Troubleshooting Methodology · Effective Datacenter Troubleshooting Methodology ... Secure Network Fabric ... (built-in “CPU sniffer”) • ELAM • EEM (Embedded

Scenario 3

N7k01 N7k02

N5k01 N5k02

Tools: PACL, sniffer

CLIs can be your best friend --

“show port-channel load-balance forwarding-path”

N7k03 N7k04

Page 25: Effective Datacenter Troubleshooting Methodology · Effective Datacenter Troubleshooting Methodology ... Secure Network Fabric ... (built-in “CPU sniffer”) • ELAM • EEM (Embedded

Verifying the Root Cause

Test against the conditions:

• Does the probable cause match the problem description

• Does the probable cause satisfy all of the conditions

Test against the cause:

• Eliminate the probable cause: does the problem get eliminated?

• Reproduce the same condition: does the problem get reproduced?

Page 26: Effective Datacenter Troubleshooting Methodology · Effective Datacenter Troubleshooting Methodology ... Secure Network Fabric ... (built-in “CPU sniffer”) • ELAM • EEM (Embedded

The TAC Secret Ingredients

Troubleshoot

Apply knowledge

Identify Possible causes

Test probable

cause

Break down the

issue

Understanding

the problem

Confirm the root cause

To: host A and host B are seeing slowness during file transfer

• Network topology

• L2 vs. L3

• Affected hosts/vlan

• Software processing

• L2 instability

• Unicast flooding

• Faulty hardware

• Ping / Traceroute

• Working vs. non-working

• Tools: PACL, SPAN, etc.

• Forwarding path of the traffic

• L2 vs. L3

• ARP

• Difference between sites

From: Network slowness

Page 27: Effective Datacenter Troubleshooting Methodology · Effective Datacenter Troubleshooting Methodology ... Secure Network Fabric ... (built-in “CPU sniffer”) • ELAM • EEM (Embedded

Troubleshooting Basics –Toolkit

Page 28: Effective Datacenter Troubleshooting Methodology · Effective Datacenter Troubleshooting Methodology ... Secure Network Fabric ... (built-in “CPU sniffer”) • ELAM • EEM (Embedded

Sniffer Capture

• SPAN (Switched Port ANalyzer): A tool to captures traffic from the source and directs to a destination interface

• Ethanalyzer

• TCPdump

• Wireshark

Pros: Commonly available on hosts, switches, appliances ( Firewalls, load balancers ), etc

Useful for intermittent packet drop, network performance type of issues

Cons:

Could be time/resource consuming if the sniffer is not readily available

Page 29: Effective Datacenter Troubleshooting Methodology · Effective Datacenter Troubleshooting Methodology ... Secure Network Fabric ... (built-in “CPU sniffer”) • ELAM • EEM (Embedded

PACL

• Port Access List

IP access list TEST_ICMP

statistics per-entry

10 permit ip 1.1.1.1/32 100.100.100.100/32 [matches=10]

20 permit ip any any [matches=17642]

Pros: Commonly available on switches

Easy to use, quick to get results

Useful for intermittent packet drop, network performance type of issues

Cons:

Requires configuration changes, which may not be possible on the fly for some deployments due to change control

Available ingress direction only

Page 30: Effective Datacenter Troubleshooting Methodology · Effective Datacenter Troubleshooting Methodology ... Secure Network Fabric ... (built-in “CPU sniffer”) • ELAM • EEM (Embedded

Ethanalyzer• Built-in sniffer for CPU bound traffic

• ‘capture-filter’ vs. ‘display filter’

• ‘decode-internal’

• Other options

• Ethanalyzer does not

• Capture data plane traffic forwarded in hardware

• Support interface specific capture

• Ethanalyzer guides

• http://www.cisco.com/c/en/us/support/docs/switches/nexus-7000-series-switches/116136-trouble-ethanalyzer-nexus7000-00.html

• http://www.cisco.com/c/en/us/support/docs/switches/nexus-5000-series-switches/116201-technote-ethanalyzer-00.html

Page 31: Effective Datacenter Troubleshooting Methodology · Effective Datacenter Troubleshooting Methodology ... Secure Network Fabric ... (built-in “CPU sniffer”) • ELAM • EEM (Embedded

Logging Capabilities

• Persistent logging (Nexus 7000)

• Constant logging – event history

• Accounting log

• Commands:

• ‘show file logflash://sup-active//log/messages’

• ‘loggin level <feature> <level>’

• ‘show log logfile’, ‘show log nvram’

• ‘show accounting log’

• ‘show system internal <feature> event-history’

• ‘show <feature> internal event-history’

Page 32: Effective Datacenter Troubleshooting Methodology · Effective Datacenter Troubleshooting Methodology ... Secure Network Fabric ... (built-in “CPU sniffer”) • ELAM • EEM (Embedded

Granular Show Commands and CLI Filtering

• Improved IOS-like CLI

• Feature specific show commands

• ‘show run’, ‘show run <feature>’ and ‘show run all’

• ‘show’ commands can be executed from exec or config mode

• Output piping ‘show xxx | ?’

• Well structured ‘show’ commands

• ‘show system internal’

• ‘show hardware internal’

• ‘show <feature> internal’

• Useful commands

• ‘hex’ / ‘dec’

• ‘diff’

• ‘show cli history [unformatted]’

Page 33: Effective Datacenter Troubleshooting Methodology · Effective Datacenter Troubleshooting Methodology ... Secure Network Fabric ... (built-in “CPU sniffer”) • ELAM • EEM (Embedded

Granular Show Tech-support

• Capture show tech

• ‘show tech detail’

• ‘tac-pac’

• ‘show tech <feature>

• ‘show tech all binary’ (6.2.x feature)

• Need-to-know

• Collect show tech details as soon as possible

• Redirect the outputs to files using ‘>’

• Appending to files with ‘>>’

• Capture feature show tech in addition to show tech detail

Page 34: Effective Datacenter Troubleshooting Methodology · Effective Datacenter Troubleshooting Methodology ... Secure Network Fabric ... (built-in “CPU sniffer”) • ELAM • EEM (Embedded

EEM (Embedded Event Manager)

• A subsystem to automate tasks and customize the device behavior

• Event

• Notification

• Action

• Many built-in system policies: ‘show event manager system-policy’

• Event notification action

• Helpful in data gathering when the occurrence of the issue is unpredictable

Page 35: Effective Datacenter Troubleshooting Methodology · Effective Datacenter Troubleshooting Methodology ... Secure Network Fabric ... (built-in “CPU sniffer”) • ELAM • EEM (Embedded

ELAM (Embedded Logic Analyzer Module)

• A tool to capture a packet and determine its forwarding path within the switch

• Powerful and flexible triggering capability

• Module specific

• Available on Nexus 7000 and Nexus 6000

• Need-to-knows

• L2-4 data plane forwarding issues

• Consistent problem

• Not a replacement for capture utilities like Ethanalzyer or SPAN

• Elam guides• http://www.cisco.com/c/en/us/support/docs/switches/nexus-7000-series-switches/116648-technote-

product-00.html

• http://www.cisco.com/c/en/us/support/docs/switches/nexus-7000-series-switches/116647-technote-product-00.html

Page 36: Effective Datacenter Troubleshooting Methodology · Effective Datacenter Troubleshooting Methodology ... Secure Network Fabric ... (built-in “CPU sniffer”) • ELAM • EEM (Embedded

NX-OS Tools Summary• Granular show commands and CLI filtering

• Logging Capabilities

• Granular show tech-support

• GOLD (General On-Line Diagnostics)

• OBFL (On-Board Failure Logging)

• Ethanalyzer (built-in “CPU sniffer”)

• ELAM

• EEM (Embedded Event Manager)

• SPAN

• Debugs (with filters & redirection) and Debug Plugins

• Programmability

Info Collection

Hardware

Troubleshoot

Page 37: Effective Datacenter Troubleshooting Methodology · Effective Datacenter Troubleshooting Methodology ... Secure Network Fabric ... (built-in “CPU sniffer”) • ELAM • EEM (Embedded

OBFL (On-Board Failure Logging)

• Persistent logging

• 32MB onboard flash

• Logs varies events, for example• Reset reason

• Statistics history

• Kernel trace

• others

• Command

• ‘show logging onboard mod <x>’

Page 38: Effective Datacenter Troubleshooting Methodology · Effective Datacenter Troubleshooting Methodology ... Secure Network Fabric ... (built-in “CPU sniffer”) • ELAM • EEM (Embedded

GOLD (Generic OnLine Diagnostics)• A diagnostic framework runs while the system is operational

• Corrective actions are taken through Embedded Event Manager(EEM) polices

• Tests run on both Supervisors and line cards

• Tests types • Bootup

• Health Monitoring

• On-demand

• Scheduled

• Commands• ‘show diagnostics content’

• ‘show diagnostics result’

• ‘show diagnostics ?’

Page 39: Effective Datacenter Troubleshooting Methodology · Effective Datacenter Troubleshooting Methodology ... Secure Network Fabric ... (built-in “CPU sniffer”) • ELAM • EEM (Embedded

Debugs

• When event-history is not sufficient

• Use debug logfile ‘debug logfile <file>’

• Use debug-filter

• Debug-filter

• More granular debugs

• Can apply multiple filters simultaneously

• Commands

‘debug-filter pktmgr interface e1/1’

‘debug-filter pktmgr dest-mac 0100.5e00.000D’

‘show debug-filter all’

Page 40: Effective Datacenter Troubleshooting Methodology · Effective Datacenter Troubleshooting Methodology ... Secure Network Fabric ... (built-in “CPU sniffer”) • ELAM • EEM (Embedded

Programmability

• Adds control blocks in the CLI execution

• NX-API

• Python

– Cli(), Clid(), Clip()

– Interactive mode

– Noninteractive mode

• TCL– Tcl8.5, NXOS 5.1(1)

– ‘ tclsh bootflash:example.tcl’

• Search for “python API” on cisco.com

Page 41: Effective Datacenter Troubleshooting Methodology · Effective Datacenter Troubleshooting Methodology ... Secure Network Fabric ... (built-in “CPU sniffer”) • ELAM • EEM (Embedded

Case Studies

Page 42: Effective Datacenter Troubleshooting Methodology · Effective Datacenter Troubleshooting Methodology ... Secure Network Fabric ... (built-in “CPU sniffer”) • ELAM • EEM (Embedded

Case Study – Issue 1

Page 43: Effective Datacenter Troubleshooting Methodology · Effective Datacenter Troubleshooting Methodology ... Secure Network Fabric ... (built-in “CPU sniffer”) • ELAM • EEM (Embedded

Problem As Described

We recently migrated from Catalyst 6500 to a pair of Nexus switches in vPC.

The setup has all of our access switches connected to the pair. As soon as we cut over to the Nexus, we see performance issue across the data center. Phones stop working and a few servers are not reachable. Everything was working fine before the cut over

Page 44: Effective Datacenter Troubleshooting Methodology · Effective Datacenter Troubleshooting Methodology ... Secure Network Fabric ... (built-in “CPU sniffer”) • ELAM • EEM (Embedded

Nexus Pair Cat 6500

FTP Server

Users

Page 45: Effective Datacenter Troubleshooting Methodology · Effective Datacenter Troubleshooting Methodology ... Secure Network Fabric ... (built-in “CPU sniffer”) • ELAM • EEM (Embedded

Multiple problems are described here. Best we handle one problem at a time

1a) Slowness across Data Center

1b) Phones are not working

1c) Some servers are not reachable

Page 46: Effective Datacenter Troubleshooting Methodology · Effective Datacenter Troubleshooting Methodology ... Secure Network Fabric ... (built-in “CPU sniffer”) • ELAM • EEM (Embedded

Nexus Pair Cat 6500

FTP Server

Users

Issue 1a – Slowness Across Data Center

Page 47: Effective Datacenter Troubleshooting Methodology · Effective Datacenter Troubleshooting Methodology ... Secure Network Fabric ... (built-in “CPU sniffer”) • ELAM • EEM (Embedded

Asking the right questions

Slowness across Data Center

What application(s) specifically that users have problems accessing (say FTP)

Q) Is the user able to start the FTP process – YES

Q) Define slow - When files is being downloaded it takes 30 minutes instead of 3-5 minutes earlier

Q) Are all users affected? Not sure, all users are on VLAN 50.

Q) What do you observe when a user is moved to a different VLAN – Works Fine

Q) Are all users in VLAN 50 connected to the same access layer switch – NO – does not matter where the user is connected, if the machine is on VLAN 50 FTP is slow –Look for common point in the network where the traffic goes – Nexus pair

We have narrowed down the issue to be with the Nexus pair – lets dig deeper.

Page 48: Effective Datacenter Troubleshooting Methodology · Effective Datacenter Troubleshooting Methodology ... Secure Network Fabric ... (built-in “CPU sniffer”) • ELAM • EEM (Embedded

What to look for

Interface Drops – Given the problem is specifically with VLAN 50, very unlikely drops on the port as traffic from other VLANs traverse the same physical links

Spanning Tree – Use CLI commands, e.g. “show spanning-tree detail” to look for excessive Topology Change Notification, as too many TCNs result in flooding

Forwarding path -- See if the traffic is getting software switched for any reason, we can use a tool called Ethanalyzer

Page 49: Effective Datacenter Troubleshooting Methodology · Effective Datacenter Troubleshooting Methodology ... Secure Network Fabric ... (built-in “CPU sniffer”) • ELAM • EEM (Embedded

Ethanalyzer – See if Packets are CPU switched

Ethanalyzer is a tool to monitor traffic to and from the CPU

N7k#ethanalyzer local interface inband capture-filter tcp

2014-01-28 17:19:55.730066 10.127.76.137 -> 10.8.51.4 TCP [TCP Dup ACK 617#20] microsoft-ds > venus [ACK]

Seq=1 Ack=1 Win=17520 Len=0

2014-01-28 17:19:55.730193 10.127.76.137 -> 10.8.51.4 TCP [TCP Dup ACK 617#21] microsoft-ds > venus [ACK]

Seq=1 Ack=1 Win=17520 Len=0

2014-01-28 17:19:55.730340 10.127.76.137 -> 10.8.51.4 TCP [TCP Dup ACK 617#22] microsoft-ds > venus [ACK]

Seq=1 Ack=1 Win=17520 Len=0

Aha!

Traffic getting software switched

We have narrowed down the slowness to be a result of packets getting software switched.

Page 50: Effective Datacenter Troubleshooting Methodology · Effective Datacenter Troubleshooting Methodology ... Secure Network Fabric ... (built-in “CPU sniffer”) • ELAM • EEM (Embedded

What can cause traffic to go to CPU

• Packets with IP options

• Packets with length greater than MTU causing it to be fragmented

• IP redirects – When the next hop is reachable on same VLAN as the one the packet comes in on – Check IP route for source and destination IP (‘no ip redirects’ under interface vlan)

• Hardware Misprogramming – Call TAC

Page 51: Effective Datacenter Troubleshooting Methodology · Effective Datacenter Troubleshooting Methodology ... Secure Network Fabric ... (built-in “CPU sniffer”) • ELAM • EEM (Embedded

Nexus Pair Cat 6500

FTP Server

Users

Issue 1b – Phones Not Working

DHCP Server

Page 52: Effective Datacenter Troubleshooting Methodology · Effective Datacenter Troubleshooting Methodology ... Secure Network Fabric ... (built-in “CPU sniffer”) • ELAM • EEM (Embedded

Asking the right questions

Q) Are all phones down – No, only phones in VLAN 10

Q) Is it rebooting? What phase in bootup is failing? Unable to get IP address

Q) Packet capture on DHCP server – No discovers from these specific phones are seen

Q) Are all phones in VLAN 10 connected to one switch – No, it is across the network

Start looking from the CORE – Nexus pair

Page 53: Effective Datacenter Troubleshooting Methodology · Effective Datacenter Troubleshooting Methodology ... Secure Network Fabric ... (built-in “CPU sniffer”) • ELAM • EEM (Embedded

Asking the right questions

Q) What is different about VLAN 10? Nothing different

Q) What VLAN is the DHCP server in? 3 servers, one in VLAN 10 and others in VLAN 20 and 30 , but the one in VLAN 10 is the only active DHCP server at this point.

DHCP server in same VLAN as the phone – Good data point

Q) No discovers from these specific phones are seen on the server, does it reach the Nexus switches? Let’s use SPAN/Ethanalyzer to confirm

Page 54: Effective Datacenter Troubleshooting Methodology · Effective Datacenter Troubleshooting Methodology ... Secure Network Fabric ... (built-in “CPU sniffer”) • ELAM • EEM (Embedded

Ethanalyzer – See if DHCP relay works

sw1(config-if)# ethanalyzer local interface inband capture-filter "port 67" limit-

captured-frames 0

Capturing on inband

2014-05-01 06:20:41.793378 0.0.0.0 -> 255.255.255.255 DHCP DHCP Discover -

Transaction ID 0x3e96b16d

2014-05-01 06:20:41.793763 10.7.1.2 -> 10.5.1.220 DHCP DHCP Discover -

Transaction ID 0x3e96b16d

2014-05-01 06:20:41.793763 10.7.1.2 -> 10.6.1.220 DHCP DHCP Discover -

Transaction ID 0x3e96b16d

Nexus switch is sending the relayed packets to servers in VLAN 20 and 30 that are not active.

We are assuming the DISCOVER being a broadcast packet would make it to the server on the same VLAN

NOT TRUE

On Nexus when relay is used, we need to specify the DHCP server even when it is on the same VLAN

Page 55: Effective Datacenter Troubleshooting Methodology · Effective Datacenter Troubleshooting Methodology ... Secure Network Fabric ... (built-in “CPU sniffer”) • ELAM • EEM (Embedded

Nexus Pair Cat 6500

Issue 1c – Some Servers are not reachable

Servers

Page 56: Effective Datacenter Troubleshooting Methodology · Effective Datacenter Troubleshooting Methodology ... Secure Network Fabric ... (built-in “CPU sniffer”) • ELAM • EEM (Embedded

Asking the right questions

Q) Are you able to ping the server from Nexus – YES

Q) Are you able to ping the server from Nexus sourcing a different VLAN IP – No

Something must be wrong with default gateway setting – But why did it work with Cat6500 then?

Let’s see how the packets look like using ethanalyzer, when we are pinging from the Nexus

Page 57: Effective Datacenter Troubleshooting Methodology · Effective Datacenter Troubleshooting Methodology ... Secure Network Fabric ... (built-in “CPU sniffer”) • ELAM • EEM (Embedded

Ethanalyzer : See what traffic is received

N7k#ethanalyzer local interface inband capture-filter “arp or host

172.16.3.41"

2014-03-20 15:15:42.996037 00:0f:bb:18:c5:70 -> ff:ff:ff:ff:ff:ff

ARP Who has 192.168.1.1? Tell 172.16.3.41

This does not look right, the end server is ARPing for the destination instead of its gateway.

Need Nexus to proxy reply to the ARP .

It worked with Cat6500 because proxy ARP is on by default. On Nexus platforms proxy ARP is disabled by default.

Page 58: Effective Datacenter Troubleshooting Methodology · Effective Datacenter Troubleshooting Methodology ... Secure Network Fabric ... (built-in “CPU sniffer”) • ELAM • EEM (Embedded

Conclusion

How did we solve these issues

1) Taking them one issue at a time

2) Asking the right questions to narrow down the scope of the problem

3) Using logical reasoning and right tools to root cause the problem

Page 59: Effective Datacenter Troubleshooting Methodology · Effective Datacenter Troubleshooting Methodology ... Secure Network Fabric ... (built-in “CPU sniffer”) • ELAM • EEM (Embedded

Case Study – Issue 2

Page 60: Effective Datacenter Troubleshooting Methodology · Effective Datacenter Troubleshooting Methodology ... Secure Network Fabric ... (built-in “CPU sniffer”) • ELAM • EEM (Embedded

Problem As Described

At 1AM everyday there is a network outage. Servers lose access to the network.

Page 61: Effective Datacenter Troubleshooting Methodology · Effective Datacenter Troubleshooting Methodology ... Secure Network Fabric ... (built-in “CPU sniffer”) • ELAM • EEM (Embedded

Asking the right questions

Are there any scheduled activities that take place at 1AM every night? None we are aware of

Is the whole network affected? Yes, servers across multiple segments affected

What is done to restore connectivity? Nothing, it recovers by itself

How long does this last? For 2-3 minutes

Page 62: Effective Datacenter Troubleshooting Methodology · Effective Datacenter Troubleshooting Methodology ... Secure Network Fabric ... (built-in “CPU sniffer”) • ELAM • EEM (Embedded

Where do we start

Since the entire network is affected, start from the CORE

Check logs for any activities at 1AM for the last few days

Check for any STP event, routing changes at that time

Page 63: Effective Datacenter Troubleshooting Methodology · Effective Datacenter Troubleshooting Methodology ... Secure Network Fabric ... (built-in “CPU sniffer”) • ELAM • EEM (Embedded

Logs

show logging log:

2013 Apr 7 01:00:03 Nexus %VPC-2-PEER_KEEP_ALIVE_RECV_FAIL: In domain 10, VPC peer

keep-alive receive has failed

2013 Apr 7 01:00:22 Nexus %STP-2-DISPUTE_DETECTED: Dispute detected on port

Ethernet10/13 on VLAN1616

2013 Apr 7 01:00:30 Nexus %STP-2-DISPUTE_CLEARED: Dispute resolved for port

Ethernet10/13 on VLAN1616

2013 Apr 7 01:02:36 Nexus %BGP-3-NOTIFICATION: sent to neighbor 10.137.29.9 4/0

(hold time expired) 0 bytes

show accounting log:

Sun Apr 7 01:00:01 2013:type=start:id=171.69.89.32@pts/0:user=prime:cmd=

Sun Apr 7 01:00:03 2013:type=update:id=171.69.89.32@pts/0:user=prime:cmd=show

access-list

Page 64: Effective Datacenter Troubleshooting Methodology · Effective Datacenter Troubleshooting Methodology ... Secure Network Fabric ... (built-in “CPU sniffer”) • ELAM • EEM (Embedded

CPU History

Multiple control plane protocols were affected – most likely CPU spiked

Show processes cpu history

Page 65: Effective Datacenter Troubleshooting Methodology · Effective Datacenter Troubleshooting Methodology ... Secure Network Fabric ... (built-in “CPU sniffer”) • ELAM • EEM (Embedded

How do we find out what process spiked?

EEM script

event manager applet HIGH-CPU

event snmp oid 1.3.6.1.4.1.9.9.109.1.1.1.1.6.1 get-type exact entry-op ge

entry-val 90 exit-val 50 poll-interval 5

action 1.0 syslog msg High CPU hit $_event_pub_time

action 2.0 cli enable

action 3.0 cli show clock >> bootflash:high-cpu.txt

action 4.0 cli show processes cpu sort >> bootflash:high-cpu.txt

Page 66: Effective Datacenter Troubleshooting Methodology · Effective Datacenter Troubleshooting Methodology ... Secure Network Fabric ... (built-in “CPU sniffer”) • ELAM • EEM (Embedded

Output of EEM Script

PID Runtime(ms) Invoked uSecs 1Sec Process

----- ----------- -------- ----- ------ -----------

4857 5986662 12668354 472 45.7% aclmgr

7069 770167668 689087915 1117 11.5% statsclient

4720 516729732 166565721 3102 6.7% oc_usd

5534 216 41 5269 6.7% pim

4915 262 702 374 5.8% netstack

5485 899469787 2147483647 236 3.8% stp

4667 108772307 105958793 1026 2.9% R2D2_usd

ACLMGR is the process

invoked while ACL configuration

is polled or changed

Page 67: Effective Datacenter Troubleshooting Methodology · Effective Datacenter Troubleshooting Methodology ... Secure Network Fabric ... (built-in “CPU sniffer”) • ELAM • EEM (Embedded

What was found

Tool was polling ACL configuration from the device which has over 32k ACEs

This caused ACLMGR to spike, affecting other protocols

Behavior root caused to a bug which was fixed in a newer code release

Page 68: Effective Datacenter Troubleshooting Methodology · Effective Datacenter Troubleshooting Methodology ... Secure Network Fabric ... (built-in “CPU sniffer”) • ELAM • EEM (Embedded

Case Study – Issue 3

Page 69: Effective Datacenter Troubleshooting Methodology · Effective Datacenter Troubleshooting Methodology ... Secure Network Fabric ... (built-in “CPU sniffer”) • ELAM • EEM (Embedded

Problem As Described

Network was severely degraded for several hours. It eventually recovered by itself, need to understand what caused the issue – RCA (root cause analysis)

Page 70: Effective Datacenter Troubleshooting Methodology · Effective Datacenter Troubleshooting Methodology ... Secure Network Fabric ... (built-in “CPU sniffer”) • ELAM • EEM (Embedded

Asking the right questions

Q) What specific applications were affected? All applications

Q) Were users unable to connect at all or was it slow? Both

Q) What changes were made if any ? No known changes

Q) What was done to fix this ? Nothing known

Q) What time did the issue start? How long did it last? Issue started at around 1PM and stabilized at around 11:30PM

Page 71: Effective Datacenter Troubleshooting Methodology · Effective Datacenter Troubleshooting Methodology ... Secure Network Fabric ... (built-in “CPU sniffer”) • ELAM • EEM (Embedded

What to check? Where to start?

Network wide events – Check Spanning Tree events, routing loops, broadcast storm

Starting from the core, check all switches for interface drops, error logs

Inspect CoPP for any drops on the Core Switches

Check monitoring tools for alerts on high utilization of the links, high cpu spikes, failures, etc.

Page 72: Effective Datacenter Troubleshooting Methodology · Effective Datacenter Troubleshooting Methodology ... Secure Network Fabric ... (built-in “CPU sniffer”) • ELAM • EEM (Embedded

Checking Spanning-tree

Nexus# show spanning-tree detail

VLAN0100 is executing the rstp compatible Spanning Tree protocol

Bridge Identifier has priority 0, sysid 1, address c84c.75fa.6000

Configured hello time 2, max age 20, forward delay 15

We are the root of the spanning tree

Topology change flag not set, detected flag not set

Number of topology changes 6 last change occurred 160:10:57 ago >>> Spanning Tree

looks stable

from port-channel20

Times: hold 1, topology change 35, notification 2

hello 2, max age 20, forward delay 15

Timers: hello 0, topology change 0, notification 0

Page 73: Effective Datacenter Troubleshooting Methodology · Effective Datacenter Troubleshooting Methodology ... Secure Network Fabric ... (built-in “CPU sniffer”) • ELAM • EEM (Embedded

Checking interface drops

Nexus# show interface counter error

--------------------------------------------------------------------------------

Port Align-Err FCS-Err Xmit-Err Rcv-Err UnderSize OutDiscards

--------------------------------------------------------------------------------

Eth3/1 0 0 0 0 0 0

Eth3/2 0 0 0 0 0 105430

Eth3/3 0 0 0 0 0 105421

Eth3/4 0 0 0 0 0 105794

Similar output drop

counter

Page 74: Effective Datacenter Troubleshooting Methodology · Effective Datacenter Troubleshooting Methodology ... Secure Network Fabric ... (built-in “CPU sniffer”) • ELAM • EEM (Embedded

Checking Control Plane Policing Drops

Nexus# show policy-map interface control-plane

<snip>

class-map copp-system-p-class-normal (match-any)

[snip]

match protocol arp

set cos 1

police cir 680 kbps bc 250 ms

conform action: transmit

violate action: drop

module 3:

conformed 4582560313 bytes,

5-min offered rate 3452 bytes/sec

violated 37822500313 bytes,

5-min violate rate 0 bytes/sec

Huge violations in ARP class – Either a loop in the

network or a misbehaving device flooding network

with ARP traffic

Page 75: Effective Datacenter Troubleshooting Methodology · Effective Datacenter Troubleshooting Methodology ... Secure Network Fabric ... (built-in “CPU sniffer”) • ELAM • EEM (Embedded

Check SNMP server for link utilization graph

Page 76: Effective Datacenter Troubleshooting Methodology · Effective Datacenter Troubleshooting Methodology ... Secure Network Fabric ... (built-in “CPU sniffer”) • ELAM • EEM (Embedded

What have we learnt so far?

Clearly a broadcast storm in the network

Storm is that of ARP packets

ARP storm has caused congestion and could have ended up dropping valid ARP packets

WHAT CAUSED IT / WHERE DID IT ORIGINATE ???

Page 77: Effective Datacenter Troubleshooting Methodology · Effective Datacenter Troubleshooting Methodology ... Secure Network Fabric ... (built-in “CPU sniffer”) • ELAM • EEM (Embedded

How to we identify what caused the storm

Track the origin using simple “show interface counter”

Start from the CORE and inspect which interface has very high InBroadcast

--------------------------------------------------------------------------------

Port InMcastPkts InBcastPkts

--------------------------------------------------------------------------------

<snip>

Eth3/10 336 15315895309

Page 78: Effective Datacenter Troubleshooting Methodology · Effective Datacenter Troubleshooting Methodology ... Secure Network Fabric ... (built-in “CPU sniffer”) • ELAM • EEM (Embedded

How to we identify what caused the storm

Track the origin using simple “show interface counter”

Track it to access layer switch to determine where this could have come from

--------------------------------------------------------------------------------

Port InMcastPkts InBcastPkts

--------------------------------------------------------------------------------

<snip>

Eth105/1/11 336 1531589530

Eth105/1/12 336 1527225940

<snip>

Eth107/1/3 337 1544275579

Page 79: Effective Datacenter Troubleshooting Methodology · Effective Datacenter Troubleshooting Methodology ... Secure Network Fabric ... (built-in “CPU sniffer”) • ELAM • EEM (Embedded

Conclusion

All of this traffic was a storm origination from a few servers

New application was deployed on those servers

Server team had reverted the changes at around11PM (Did not tell anyone )

What could have prevented this?

Designing network using best practice such as STORM control

Maintain a log of all changes made so that it can be referenced when needed

Page 80: Effective Datacenter Troubleshooting Methodology · Effective Datacenter Troubleshooting Methodology ... Secure Network Fabric ... (built-in “CPU sniffer”) • ELAM • EEM (Embedded

Issue 4 : Intermittent connectivity loss to servers

Page 81: Effective Datacenter Troubleshooting Methodology · Effective Datacenter Troubleshooting Methodology ... Secure Network Fabric ... (built-in “CPU sniffer”) • ELAM • EEM (Embedded

MPLS PE

Servers

Users

MPLS PE MPLS P

Default VRF

Server VRF

Page 82: Effective Datacenter Troubleshooting Methodology · Effective Datacenter Troubleshooting Methodology ... Secure Network Fabric ... (built-in “CPU sniffer”) • ELAM • EEM (Embedded

Asking the right questions

Q) How often Users lose connectivity to servers? Random , no pattern

Q) Is connectivity lost to all servers ? No, a few servers are reachable and a few are not

Q) Do all users lose connectivity to a specific server? Yes

Q) What do you do to fix it? Nothing , it resolves by itself

Q) Is the server reachable from its default gateway? When we try to ping server from its gateway, first few pings fail after which server is reachable from its gateway and almost immediately USERs are able to connect

Page 83: Effective Datacenter Troubleshooting Methodology · Effective Datacenter Troubleshooting Methodology ... Secure Network Fabric ... (built-in “CPU sniffer”) • ELAM • EEM (Embedded

Inspect if ARP Glean works

Ethanalyzer – Check if packets are punted to CPU for it to generate an ARP request

N7k#ethanalyzer local interface inband capture-filter “arp or host

172.25.3.41"

Nothing seen in Ethanalyzer – ARP never completes as ARP request is not generated.This is why USERs lose access to Servers

Page 84: Effective Datacenter Troubleshooting Methodology · Effective Datacenter Troubleshooting Methodology ... Secure Network Fabric ... (built-in “CPU sniffer”) • ELAM • EEM (Embedded

Hardware Rate Limiter

Hardware rate limiters are in place to protect CPU (like CoPP)

N7k# show hardware rate-limiter layer-3 glean

Units for Config: packets per second

Allowed, Dropped & Total: aggregated since last clear counters

Rate Limiter Class Parameters

------------------------------------------------------------

layer-3 glean Config : 100

Allowed : 10146910

Dropped : 4636432 >> Increasing at a rapid rate

Total : 14783342

Page 85: Effective Datacenter Troubleshooting Methodology · Effective Datacenter Troubleshooting Methodology ... Secure Network Fabric ... (built-in “CPU sniffer”) • ELAM • EEM (Embedded

Conclusion

By asking the right questions we were able to rule out the complex MPLS network as the cause.

With the knowledge of how glean comes to play we narrowed down the issue to be a result of excess Glean traffic causing hardware rate limiter to kick in.

Glean Throttling was implemented to fix this issue.

Page 86: Effective Datacenter Troubleshooting Methodology · Effective Datacenter Troubleshooting Methodology ... Secure Network Fabric ... (built-in “CPU sniffer”) • ELAM • EEM (Embedded

The Dos and The Donts

Page 87: Effective Datacenter Troubleshooting Methodology · Effective Datacenter Troubleshooting Methodology ... Secure Network Fabric ... (built-in “CPU sniffer”) • ELAM • EEM (Embedded

The Dos

• Understand how things should work (on your network)

• Identify the broken scenario – define “broken” and/or “not working”

• Determine possible triggers, patterns, time frame

• Use solid troubleshooting techniques, start with basics

• Capture valuable information

• Ask the right questions

Page 88: Effective Datacenter Troubleshooting Methodology · Effective Datacenter Troubleshooting Methodology ... Secure Network Fabric ... (built-in “CPU sniffer”) • ELAM • EEM (Embedded

The Dos - Continued

• Stay calm

• Bring all relevant parties to the table

• Backup

• Documentation (network topology, traffic flow, IP addressing, etc.)

• Network Management

Page 89: Effective Datacenter Troubleshooting Methodology · Effective Datacenter Troubleshooting Methodology ... Secure Network Fabric ... (built-in “CPU sniffer”) • ELAM • EEM (Embedded

The DONTs

• Jump to conclusion – most time it’s not a bug

• Take drastic measures prematurely

• 'let's bounce the datacenter'

• 'we are reloading the switches one at a time'

• Lump all issues together

• Make multiple changes at once

• Status update and technical call in one

Page 90: Effective Datacenter Troubleshooting Methodology · Effective Datacenter Troubleshooting Methodology ... Secure Network Fabric ... (built-in “CPU sniffer”) • ELAM • EEM (Embedded

The TAC secret ingredients

Troubleshoot

Apply Knowledge

Identify possible causes

Test the Most

Probable cause

Break down the issue

Understanding

the problem

Confirm the root cause

Page 91: Effective Datacenter Troubleshooting Methodology · Effective Datacenter Troubleshooting Methodology ... Secure Network Fabric ... (built-in “CPU sniffer”) • ELAM • EEM (Embedded

Complete Your Online Session Evaluation

Don’t forget: Cisco Live sessions will be available for viewing on-demand after the event at CiscoLive.com/Online

• Give us your feedback to be entered into a Daily Survey Drawing. A daily winner will receive a $750 Amazon gift card.

• Complete your session surveys though the Cisco Live mobile app or your computer on Cisco Live Connect.

Page 92: Effective Datacenter Troubleshooting Methodology · Effective Datacenter Troubleshooting Methodology ... Secure Network Fabric ... (built-in “CPU sniffer”) • ELAM • EEM (Embedded

Continue Your Education

• Demos in the Cisco campus

• Walk-in Self-Paced Labs

• Table Topics

• Meet the Engineer 1:1 meetings

• Related sessions

Page 93: Effective Datacenter Troubleshooting Methodology · Effective Datacenter Troubleshooting Methodology ... Secure Network Fabric ... (built-in “CPU sniffer”) • ELAM • EEM (Embedded

Q&A

Page 94: Effective Datacenter Troubleshooting Methodology · Effective Datacenter Troubleshooting Methodology ... Secure Network Fabric ... (built-in “CPU sniffer”) • ELAM • EEM (Embedded

Thank you