![Page 1: Troubleshooting Storage Performance - WordPress.com · Agenda • The ESXi Storage Stack • Troubleshooting Performance • Recommended Practices, Tools and Tips • Steve’s 4-dimensional](https://reader034.vdocuments.mx/reader034/viewer/2022042612/5f754863aa7f752875426b5b/html5/thumbnails/1.jpg)
© 2017 VMware Inc. All rights reserved.
Toronto VMUG Q1 Meeting
Steve Sykes, Staff EngineerGlobal Support, Premier ServicesJanuary 31, 2017
Troubleshooting Storage Performance
![Page 2: Troubleshooting Storage Performance - WordPress.com · Agenda • The ESXi Storage Stack • Troubleshooting Performance • Recommended Practices, Tools and Tips • Steve’s 4-dimensional](https://reader034.vdocuments.mx/reader034/viewer/2022042612/5f754863aa7f752875426b5b/html5/thumbnails/2.jpg)
Agenda
• The ESXi Storage Stack
• Troubleshooting Performance
• Recommended Practices, Tools and Tips
• Steve’s 4-dimensional framework for latency
• Sample case # 1: Latency
• Sample case # 2: Unresponsive guests
• Community and other useful resources
2
![Page 3: Troubleshooting Storage Performance - WordPress.com · Agenda • The ESXi Storage Stack • Troubleshooting Performance • Recommended Practices, Tools and Tips • Steve’s 4-dimensional](https://reader034.vdocuments.mx/reader034/viewer/2022042612/5f754863aa7f752875426b5b/html5/thumbnails/3.jpg)
VMware ESXi Architecture
PhysicalHardware
ESXi
Virtual Machine
Guest OS
Monitor (BT, HW, PV)
Memory
Allocator
NIC Drivers
Virtual Switch
I/O Drivers
File SystemScheduler
Virtual NIC Virtual SCSI
TCP/IP
File
System
I/O Drivers
![Page 4: Troubleshooting Storage Performance - WordPress.com · Agenda • The ESXi Storage Stack • Troubleshooting Performance • Recommended Practices, Tools and Tips • Steve’s 4-dimensional](https://reader034.vdocuments.mx/reader034/viewer/2022042612/5f754863aa7f752875426b5b/html5/thumbnails/4.jpg)
Disk I/O Latencies
ApplicationGuest OS
ESX Storage
Stack
VMM
Driver
KAVG
DAVG
GAVG
QAVG
Fabric
vSCSI
HBA
Time spent in ESX storage stack
is minimal, for all practical
purposes
KAVG ~= QAVG
In a well configured system QAVG
should be zero
* KAVG = GAVG – DAVG
Array SP
![Page 5: Troubleshooting Storage Performance - WordPress.com · Agenda • The ESXi Storage Stack • Troubleshooting Performance • Recommended Practices, Tools and Tips • Steve’s 4-dimensional](https://reader034.vdocuments.mx/reader034/viewer/2022042612/5f754863aa7f752875426b5b/html5/thumbnails/5.jpg)
Disk I/O Queues
GQLEN – Guest Queue
AQLEN – Adapter Queue
WQLEN – World Queue
DQLEN – Device / LUN
Queue
SQLEN – Array SP Queue
DQLEN
WQLEN
SQLEN
GQLEN
DQLEN can change dynamically
when SIOC is enabled
Reported in esxtopAQLEN
ApplicationGuest OS
ESX Storage
Stack
VMM
Driver
Fabric
vSCSI
HBA
Array SP
![Page 6: Troubleshooting Storage Performance - WordPress.com · Agenda • The ESXi Storage Stack • Troubleshooting Performance • Recommended Practices, Tools and Tips • Steve’s 4-dimensional](https://reader034.vdocuments.mx/reader034/viewer/2022042612/5f754863aa7f752875426b5b/html5/thumbnails/6.jpg)
Use the Right Tool
• esxtop
2 sec data points, VERY granular, not scalable across hosts
• vRealize Operations
5 min data points, very scalable, best starting view
• vCenter Performance Charts
20 sec data points, okay real-time data, poor history, recommend vROPs
• VSAN Observer
Most detailed tool to troubleshoot VSAN related performance
• 3rd Party
Ensure you know what the counters mean and their sample rate
![Page 7: Troubleshooting Storage Performance - WordPress.com · Agenda • The ESXi Storage Stack • Troubleshooting Performance • Recommended Practices, Tools and Tips • Steve’s 4-dimensional](https://reader034.vdocuments.mx/reader034/viewer/2022042612/5f754863aa7f752875426b5b/html5/thumbnails/7.jpg)
vRealize Operations
• vRealize Operations
– Manage storage performance
on scale
– Integrate with your storage
OEM
• VMware Virtual SAN with a
management pack for storage
monitoring:
– Virtual SAN Object and
component limits
– Disk/Disk Groups.
– Virtual SAN datastore
![Page 8: Troubleshooting Storage Performance - WordPress.com · Agenda • The ESXi Storage Stack • Troubleshooting Performance • Recommended Practices, Tools and Tips • Steve’s 4-dimensional](https://reader034.vdocuments.mx/reader034/viewer/2022042612/5f754863aa7f752875426b5b/html5/thumbnails/8.jpg)
VSAN Observer
• VSAN Observer is the
engineering performance tool.
– Latency
– IOPS
– Congestion
– OutstandingIO
– Bandwidth
• Do not use esxtop
![Page 9: Troubleshooting Storage Performance - WordPress.com · Agenda • The ESXi Storage Stack • Troubleshooting Performance • Recommended Practices, Tools and Tips • Steve’s 4-dimensional](https://reader034.vdocuments.mx/reader034/viewer/2022042612/5f754863aa7f752875426b5b/html5/thumbnails/9.jpg)
Storage: Key Indicators
• Kernel Latency Average (KAVG)
This counter tracks the latencies of IO passing thru the Kernel
Investigation Threshold: 1ms
• Device Latency Average (DAVG)
This is the latency seen at the device driver level. It includes the round-trip time between the HBA and the storage.
Investigation Threshold: 10-15ms, lower is better, some spikes okay
• Guest Latency Average (GAVG)
This is the latency seen at the guest level. It is effectively DAVG + KAVG. Needed for network attached storage.
Investigation Threshold: 10-15ms, lower is better, some spikes okay
![Page 10: Troubleshooting Storage Performance - WordPress.com · Agenda • The ESXi Storage Stack • Troubleshooting Performance • Recommended Practices, Tools and Tips • Steve’s 4-dimensional](https://reader034.vdocuments.mx/reader034/viewer/2022042612/5f754863aa7f752875426b5b/html5/thumbnails/10.jpg)
esxtop– For live troubleshooting and root cause analysis, Finer Granularity (2 Second)
– Lots of Metrics reported
CPU
Scheduler
Memory
Scheduler
Virtual
SwitchvSCSI
c, i, p m d, u, vn
• c: cpu (default)
• m: memory
• n: network
• p: power management
• i: Interrupts
• d: disk adapter
• u: disk device
• v: disk VM
E S X T O P S C R E E N S
![Page 11: Troubleshooting Storage Performance - WordPress.com · Agenda • The ESXi Storage Stack • Troubleshooting Performance • Recommended Practices, Tools and Tips • Steve’s 4-dimensional](https://reader034.vdocuments.mx/reader034/viewer/2022042612/5f754863aa7f752875426b5b/html5/thumbnails/11.jpg)
esxtop disk adapter screen (d)
Host bus adapters (HBAs) - includes SCSI,
iSCSI, RAID, and FC-HBA adapters
Latency stats from the Device,
Kernel and the Guest
DAVG/cmd - Average latency (ms) from the Device (LUN)
KAVG/cmd - Average latency (ms) in the VMKernel
GAVG/cmd - Average latency (ms) in the Guest
![Page 12: Troubleshooting Storage Performance - WordPress.com · Agenda • The ESXi Storage Stack • Troubleshooting Performance • Recommended Practices, Tools and Tips • Steve’s 4-dimensional](https://reader034.vdocuments.mx/reader034/viewer/2022042612/5f754863aa7f752875426b5b/html5/thumbnails/12.jpg)
Guest Level Issues
Questions to ask
Is In Guest/App Latency > GAVG vCenter
Latency
Is Latency and IOPS Low but Performance is
“BAD”
Guest Level Queue
Guest App Tuning / Threads / Outstanding I/Os
For Very High IOP levels - Use Multiple vSCSIControllers / Disks
Guest Level Drivers PVSCSI Investigate Interrupt Coalescing
Alignment
Filesystem optimizations: fragmentation, sync/async, …
ApplicationGuest OS
![Page 13: Troubleshooting Storage Performance - WordPress.com · Agenda • The ESXi Storage Stack • Troubleshooting Performance • Recommended Practices, Tools and Tips • Steve’s 4-dimensional](https://reader034.vdocuments.mx/reader034/viewer/2022042612/5f754863aa7f752875426b5b/html5/thumbnails/13.jpg)
ESXi Level Issues
Device Queue Overflow
World Queue Limiting
High %SYS/Chargeback or VMWAIT
– Blocked Waiting on I/O
– Blocked Waiting on Swapping
High Failed Disk IOPs
SIOC Kicked in – Latency Threshold
VM IOP Limit Set
Questions to ask
Is KAVG > 1ms
Is Device Queue full
Is ESX Host CPU > 85%
IS VM SYS% > 35%
Is VMWAIT > 5%
ESX Storage
Stack
VMM
Driver
vSCSI
![Page 14: Troubleshooting Storage Performance - WordPress.com · Agenda • The ESXi Storage Stack • Troubleshooting Performance • Recommended Practices, Tools and Tips • Steve’s 4-dimensional](https://reader034.vdocuments.mx/reader034/viewer/2022042612/5f754863aa7f752875426b5b/html5/thumbnails/14.jpg)
Array Level Issues
Engage your storage partner to assist in diagnosis
Questions to ask
Is DAVG > 20ms
What is array health & utilization?
What is the array reporting for service times?
Fabric
Array SP
![Page 15: Troubleshooting Storage Performance - WordPress.com · Agenda • The ESXi Storage Stack • Troubleshooting Performance • Recommended Practices, Tools and Tips • Steve’s 4-dimensional](https://reader034.vdocuments.mx/reader034/viewer/2022042612/5f754863aa7f752875426b5b/html5/thumbnails/15.jpg)
Device Queue Full
KAVG is non-
zeroQueuing issue
LUN Queue
depth is 32
32 IOs in flight
and 32 Queued
![Page 16: Troubleshooting Storage Performance - WordPress.com · Agenda • The ESXi Storage Stack • Troubleshooting Performance • Recommended Practices, Tools and Tips • Steve’s 4-dimensional](https://reader034.vdocuments.mx/reader034/viewer/2022042612/5f754863aa7f752875426b5b/html5/thumbnails/16.jpg)
Disk I/O Queuing – World Queue
World ID
World Queue Length –
modifiable
Disk.SchedNumRequestOut
standing
![Page 17: Troubleshooting Storage Performance - WordPress.com · Agenda • The ESXi Storage Stack • Troubleshooting Performance • Recommended Practices, Tools and Tips • Steve’s 4-dimensional](https://reader034.vdocuments.mx/reader034/viewer/2022042612/5f754863aa7f752875426b5b/html5/thumbnails/17.jpg)
Background: CPU State Times
IDLE
WAIT
SWPWT blocked
VMWAIT
RUNRDY
MLMTD
Elapsed Time
CSTP
Guest I/O
Chargeback : %SYS time
CPU frequency Scaling: Turbo boost USED > (RUN – SYS)
Power management USED < (RUN – SYS)
![Page 18: Troubleshooting Storage Performance - WordPress.com · Agenda • The ESXi Storage Stack • Troubleshooting Performance • Recommended Practices, Tools and Tips • Steve’s 4-dimensional](https://reader034.vdocuments.mx/reader034/viewer/2022042612/5f754863aa7f752875426b5b/html5/thumbnails/18.jpg)
Identifying storage connectivity issues
I/O activity to NFS
datastore
System time charged
for NFS activity
NFS Connectivity Issue (1 of 2)
![Page 19: Troubleshooting Storage Performance - WordPress.com · Agenda • The ESXi Storage Stack • Troubleshooting Performance • Recommended Practices, Tools and Tips • Steve’s 4-dimensional](https://reader034.vdocuments.mx/reader034/viewer/2022042612/5f754863aa7f752875426b5b/html5/thumbnails/19.jpg)
Identifying storage connectivity issues
VM blocked,
connectivity lost to
NFS datastore
No I/O activity on the
NFS datastore
VM is not using
CPU
NFS Connectivity Issue (2 of 2)
![Page 20: Troubleshooting Storage Performance - WordPress.com · Agenda • The ESXi Storage Stack • Troubleshooting Performance • Recommended Practices, Tools and Tips • Steve’s 4-dimensional](https://reader034.vdocuments.mx/reader034/viewer/2022042612/5f754863aa7f752875426b5b/html5/thumbnails/20.jpg)
Performance Impact of Swapping
Some swapping activity
Time spent in blocked
state due to swapping
![Page 21: Troubleshooting Storage Performance - WordPress.com · Agenda • The ESXi Storage Stack • Troubleshooting Performance • Recommended Practices, Tools and Tips • Steve’s 4-dimensional](https://reader034.vdocuments.mx/reader034/viewer/2022042612/5f754863aa7f752875426b5b/html5/thumbnails/21.jpg)
Storage – Recommendations
• Use Multiple vSCSI Adapters
Allows for more queues and I/O’s in flight
• Use pvscsi vSCSI Adapter
More efficient I/O’s per cycle
• Don’t Use RDM’s
Unless needed for shared disk clustering, no longer a performance advantage
• Leverage Your Storage OEM’s Integration Guide
They provide necessary guidance around items like multi-pathing, 80% of issues solved here
![Page 22: Troubleshooting Storage Performance - WordPress.com · Agenda • The ESXi Storage Stack • Troubleshooting Performance • Recommended Practices, Tools and Tips • Steve’s 4-dimensional](https://reader034.vdocuments.mx/reader034/viewer/2022042612/5f754863aa7f752875426b5b/html5/thumbnails/22.jpg)
“VMDK on VMFS” or “RDM”
• There really is no difference in performance between
vmdk on VMFS and RDM
• https://blogs.vmware.com/vsphere/2013/01/vsphere
-5-1-vmdk-versus-rdm.html
• Use RDMs ONLY when you require shared disk
clustering (or native SAN tools)
![Page 23: Troubleshooting Storage Performance - WordPress.com · Agenda • The ESXi Storage Stack • Troubleshooting Performance • Recommended Practices, Tools and Tips • Steve’s 4-dimensional](https://reader034.vdocuments.mx/reader034/viewer/2022042612/5f754863aa7f752875426b5b/html5/thumbnails/23.jpg)
“Thick” vs “Thin”MBs I/O Throughput
• Thin (Fully Inflated and Zeroed) Disk Performance =
Thick Eager Zero Disk
• Performance impact due to zeroing, not result of
allocation of new blocks
• To get maximum performance from the start, must use
Thick Eager Zero Disks (think Business Critical Apps)
• Maximum Performance happens eventually, but when
using lazy zeroing, zeroing needs to occur before you
can get maximum performance
http://www.vmware.com/pdf/vsp_4_thinprov_perf.pdf
![Page 24: Troubleshooting Storage Performance - WordPress.com · Agenda • The ESXi Storage Stack • Troubleshooting Performance • Recommended Practices, Tools and Tips • Steve’s 4-dimensional](https://reader034.vdocuments.mx/reader034/viewer/2022042612/5f754863aa7f752875426b5b/html5/thumbnails/24.jpg)
Iometer
is an I/O subsystem measurement and characterization tool for single and clustered systems. Windows and Linux
Windows and Linux
Free (Open Source)
Single or Multi-server capable
Multi-threaded
Metrics Collected
• Total I/Os per Sec.
• Throughput (MB)
• CPU Utilization
• Latency (avg. & max)
![Page 25: Troubleshooting Storage Performance - WordPress.com · Agenda • The ESXi Storage Stack • Troubleshooting Performance • Recommended Practices, Tools and Tips • Steve’s 4-dimensional](https://reader034.vdocuments.mx/reader034/viewer/2022042612/5f754863aa7f752875426b5b/html5/thumbnails/25.jpg)
I/O Analyzer
is a virtual appliance solution, which provides a simple and standardized way of measuring storage performance.
http://labs.vmware.com/flings/io-analyzer
Readily deployable virtual appliance
Easy configuration and launch of I/O
tests on one or more hosts
I/O trace replay as an additional
workload generator
Ability to upload I/O traces for
automatic extraction of vital metrics
Graphical visualization
![Page 26: Troubleshooting Storage Performance - WordPress.com · Agenda • The ESXi Storage Stack • Troubleshooting Performance • Recommended Practices, Tools and Tips • Steve’s 4-dimensional](https://reader034.vdocuments.mx/reader034/viewer/2022042612/5f754863aa7f752875426b5b/html5/thumbnails/26.jpg)
Storage Profiling Tips and Tricks Common IO Profiles (database, web, etc): http://blogs.msdn.com/b/tvoellm/archive/2009/05/07/useful-io-profiles-for-simulating-
various-workloads.aspx
Make Sure to Check / Try:
– Load balancing / multi-pathing
– Queue depth & outstanding I/Os
– pvSCSI Device Driver
Look out for:
– I/O contention
– Disk Shares
– SIOC & SDRS
– IOP Limits
![Page 27: Troubleshooting Storage Performance - WordPress.com · Agenda • The ESXi Storage Stack • Troubleshooting Performance • Recommended Practices, Tools and Tips • Steve’s 4-dimensional](https://reader034.vdocuments.mx/reader034/viewer/2022042612/5f754863aa7f752875426b5b/html5/thumbnails/27.jpg)
vscsiStats – DEEP Storage Diagnostics
vscsiStats characterizes IO for each virtual disk
• Allows us to separate out each different type of workload into
its own container and observe trends
Histograms only collected if enabled; no overhead
otherwise
Metrics
I/O Size
Seek Distance
Outstanding I/Os
I/O Interarrival Times
Latency
![Page 28: Troubleshooting Storage Performance - WordPress.com · Agenda • The ESXi Storage Stack • Troubleshooting Performance • Recommended Practices, Tools and Tips • Steve’s 4-dimensional](https://reader034.vdocuments.mx/reader034/viewer/2022042612/5f754863aa7f752875426b5b/html5/thumbnails/28.jpg)
Steve’s 4-dimensional framework for Latency discussions
CONFIDENTIAL 28
• Magnitude
How high are the spikes, when they happen?
• Frequency
How often / what times / days do the spikes occur?
• Duration
How log do they last, when they occur?
• Spread
How many hosts / datastores are involved
![Page 29: Troubleshooting Storage Performance - WordPress.com · Agenda • The ESXi Storage Stack • Troubleshooting Performance • Recommended Practices, Tools and Tips • Steve’s 4-dimensional](https://reader034.vdocuments.mx/reader034/viewer/2022042612/5f754863aa7f752875426b5b/html5/thumbnails/29.jpg)
Magnitude – It Matters
CONFIDENTIAL 29
Image Credit: http://housepetscomic.wikia.com/wiki/File:Order_Of_Magnitude.png
![Page 30: Troubleshooting Storage Performance - WordPress.com · Agenda • The ESXi Storage Stack • Troubleshooting Performance • Recommended Practices, Tools and Tips • Steve’s 4-dimensional](https://reader034.vdocuments.mx/reader034/viewer/2022042612/5f754863aa7f752875426b5b/html5/thumbnails/30.jpg)
Magnitude – How High?
CONFIDENTIAL 30
• Magnitude - minor
– Spikes of 30-40-50 (or even 100) milliseconds, could be IOPS exceeding the underlying capacity of the hardware
– This level of magnitude will possibly cause small queues to develop
– Depending on the duration, this might cause applications to feel pain, but not intolerable – a “dull ache” periodically
• Magnitude - intermediate
– When the spikes get up towards 500 milliseconds or greater, not likely an IOPS issue
– This is approximately 50x as long as we would normally expect for each command to complete (i.e. 50 x 10 ms = 500 ms)
– Queues will most certainly develop, and the queues may get sufficiently long that the workload perceives it as an outage
• Magnitude - major
– If single SCSI commands take more than 1000 milliseconds (1 second), to execute, there are serious issues indeed
– Queues will almost certainly get sufficiently long that workloads will perceive storage is unavailable
– In both the intermediate and major cases here, duration and spread must be considered
![Page 31: Troubleshooting Storage Performance - WordPress.com · Agenda • The ESXi Storage Stack • Troubleshooting Performance • Recommended Practices, Tools and Tips • Steve’s 4-dimensional](https://reader034.vdocuments.mx/reader034/viewer/2022042612/5f754863aa7f752875426b5b/html5/thumbnails/31.jpg)
Frequency – Variations in patterns
CONFIDENTIAL 31
Image Credit: https://www.sfu.ca/~truax/Frequency_Modulation.html
![Page 32: Troubleshooting Storage Performance - WordPress.com · Agenda • The ESXi Storage Stack • Troubleshooting Performance • Recommended Practices, Tools and Tips • Steve’s 4-dimensional](https://reader034.vdocuments.mx/reader034/viewer/2022042612/5f754863aa7f752875426b5b/html5/thumbnails/32.jpg)
Frequency – How often / What days / times?
CONFIDENTIAL 32
• Frequency - occasional
– Once in a while, no set pattern in terms of day of week, time of day
– Also seemingly “random” with regard to datastores, hosts
– Not consistent over time, appears to come and go
• Frequency – some patterns
– Sometimes we see certain time frames of the day; i.e. middle of the night
– These time slots are usually reserved for maintenance type activity
– A “flood” of activity that is much more than the environment was engineered for, can be the cause
• Frequency – all over the place
– In this scenario, we see events logged all through the day / night, and multiple days of week, weeks of month
– Workloads may perceive storage is unavailable (because of excessive queuing)
– In both the intermediate and major cases here, duration and spread must be considered.
![Page 33: Troubleshooting Storage Performance - WordPress.com · Agenda • The ESXi Storage Stack • Troubleshooting Performance • Recommended Practices, Tools and Tips • Steve’s 4-dimensional](https://reader034.vdocuments.mx/reader034/viewer/2022042612/5f754863aa7f752875426b5b/html5/thumbnails/33.jpg)
Duration – How Long … ?
CONFIDENTIAL 33
Image Credit: http://www.jqueryscript.net/images/Easy-Time-Duration-Picker-Plugin-with-jQuery-jQuery-UI.jpg
![Page 34: Troubleshooting Storage Performance - WordPress.com · Agenda • The ESXi Storage Stack • Troubleshooting Performance • Recommended Practices, Tools and Tips • Steve’s 4-dimensional](https://reader034.vdocuments.mx/reader034/viewer/2022042612/5f754863aa7f752875426b5b/html5/thumbnails/34.jpg)
Duration – How Long Do the Spikes last?
CONFIDENTIAL 34
• Duration – a “blip”
– Durations of a few milliseconds, or even up to a second, are not necessarily material (unless the Magnitude is high)
– Generally don’t last long enough to cause queuing
– Consider, however, the frequency – if they are all consecutive, even short duration “blips” can add up to longer periods
• Duration – moderate in length
– Generally these are greater than 1 second, but likely in the order of < 1 minute or two
– Cause may be some sort of queue development and clearing
– Here, other factors such as frequency are relevant – if they happen too frequently, the effect can be much worse
• Duration - elongated
– Depending on the Magnitude, if spikes go on and on for many seconds, the effect can be cumulative
– If the spike lasts for minutes, and the Magnitude is sufficiently high, workloads may perceive “outages”
– Again, most important to consider this factor together with Magnitude and Frequency
![Page 35: Troubleshooting Storage Performance - WordPress.com · Agenda • The ESXi Storage Stack • Troubleshooting Performance • Recommended Practices, Tools and Tips • Steve’s 4-dimensional](https://reader034.vdocuments.mx/reader034/viewer/2022042612/5f754863aa7f752875426b5b/html5/thumbnails/35.jpg)
Spread – Hosts / Datastores
CONFIDENTIAL 35
Image Credit: https://www.wired.com/wp-content/uploads/2015/04/epi-rail-web1-1024x1024.jpg
![Page 36: Troubleshooting Storage Performance - WordPress.com · Agenda • The ESXi Storage Stack • Troubleshooting Performance • Recommended Practices, Tools and Tips • Steve’s 4-dimensional](https://reader034.vdocuments.mx/reader034/viewer/2022042612/5f754863aa7f752875426b5b/html5/thumbnails/36.jpg)
Spread – Confined vs Widespread (?)
CONFIDENTIAL 36
• Spread – confined
– If the issue is on a single host (or a small subset of the total hosts), that suggests an inquiry line
– Often it can be limited to a single cluster
– Same inference for single datastore, or small subset of datastores
• Spread – intermediate
– Multiple hosts, multiple clusters, multiple datacenter objects
– More than one array type, and/or significant representation of datastores
– This suggests more of a fabric issue, especially if multiple arrays involved
• Spread – widespread / universal
– Almost all hosts involved, many clusters and/or datacenters
– Most datastores involved also
– This may be a combination of fabric and array issues
![Page 37: Troubleshooting Storage Performance - WordPress.com · Agenda • The ESXi Storage Stack • Troubleshooting Performance • Recommended Practices, Tools and Tips • Steve’s 4-dimensional](https://reader034.vdocuments.mx/reader034/viewer/2022042612/5f754863aa7f752875426b5b/html5/thumbnails/37.jpg)
Latency - Symptoms
CONFIDENTIAL 37
Host logs: /var/run/log/vobd.log
Magnitude over half a secondDuration 39 seconds
![Page 38: Troubleshooting Storage Performance - WordPress.com · Agenda • The ESXi Storage Stack • Troubleshooting Performance • Recommended Practices, Tools and Tips • Steve’s 4-dimensional](https://reader034.vdocuments.mx/reader034/viewer/2022042612/5f754863aa7f752875426b5b/html5/thumbnails/38.jpg)
Latency - Evidence
CONFIDENTIAL 38Parsed from logs / imported to Excel for analysis
![Page 39: Troubleshooting Storage Performance - WordPress.com · Agenda • The ESXi Storage Stack • Troubleshooting Performance • Recommended Practices, Tools and Tips • Steve’s 4-dimensional](https://reader034.vdocuments.mx/reader034/viewer/2022042612/5f754863aa7f752875426b5b/html5/thumbnails/39.jpg)
But where do we look?
CONFIDENTIAL 39
DAVG: Host HBA Driver Firmware Wire Switch Wire Array Front End LUN Media And return!
![Page 40: Troubleshooting Storage Performance - WordPress.com · Agenda • The ESXi Storage Stack • Troubleshooting Performance • Recommended Practices, Tools and Tips • Steve’s 4-dimensional](https://reader034.vdocuments.mx/reader034/viewer/2022042612/5f754863aa7f752875426b5b/html5/thumbnails/40.jpg)
Strategy / Approach
CONFIDENTIAL 40
• Collect data from the hosts
– Based on vm-support log extracts
– Objective: Understand the Magnitude, Frequency, Duration and Spread
– Get everyone on the same page regarding the symptoms
• Share the data collaboratively
– Both storage and fabric support teams
– Get the correlating data from the array stats
– Compare with the hosts’ experience
• Does the array data agree with the host experience?
– If so, then array support / vendor can investigate / make changes
– If not, then issue must be in the fabric, so different direction for the investigation
– After any changes are made, collect fresh logs and perform comparative analysis
![Page 41: Troubleshooting Storage Performance - WordPress.com · Agenda • The ESXi Storage Stack • Troubleshooting Performance • Recommended Practices, Tools and Tips • Steve’s 4-dimensional](https://reader034.vdocuments.mx/reader034/viewer/2022042612/5f754863aa7f752875426b5b/html5/thumbnails/41.jpg)
Configuration issues – many possibilities
CONFIDENTIAL 41
• Round Robin PSP
– IOPS=1000 vs IOPS=1
• Fabric
– Switches (hardware / field upgradeable code such as firmware)
– Cable plant (defective or inferior quality cables / connectors)
– Zoning issues
• HBA issues
– Drivers / firmware / hardware issues
– Queue depth settings
• Array issues
– Front end processor issues
– Defective media
– De-duplication and other overhead activity
– High % of cache misses
![Page 42: Troubleshooting Storage Performance - WordPress.com · Agenda • The ESXi Storage Stack • Troubleshooting Performance • Recommended Practices, Tools and Tips • Steve’s 4-dimensional](https://reader034.vdocuments.mx/reader034/viewer/2022042612/5f754863aa7f752875426b5b/html5/thumbnails/42.jpg)
Sample Case # 2 – Unresponsive guests
CONFIDENTIAL 42
… and yet, can ping the hosts, no apparent network issues
Can this be a storage issue?
![Page 43: Troubleshooting Storage Performance - WordPress.com · Agenda • The ESXi Storage Stack • Troubleshooting Performance • Recommended Practices, Tools and Tips • Steve’s 4-dimensional](https://reader034.vdocuments.mx/reader034/viewer/2022042612/5f754863aa7f752875426b5b/html5/thumbnails/43.jpg)
CONFIDENTIAL 43
Let’s look in the logs
CONFIDENTIAL 43
Host logs: /var/run/log/vobd.log
Between 11:19:05.105Z and 11:21:46.217Z, no I/O scheduled for datastore UUID 4a80
https://kb.vmware.com/kb/2136081 - "Understanding lost access to volume messages in ESXi 5.5/6.x"
![Page 44: Troubleshooting Storage Performance - WordPress.com · Agenda • The ESXi Storage Stack • Troubleshooting Performance • Recommended Practices, Tools and Tips • Steve’s 4-dimensional](https://reader034.vdocuments.mx/reader034/viewer/2022042612/5f754863aa7f752875426b5b/html5/thumbnails/44.jpg)
And speaking of logs …
CONFIDENTIAL 44
• Many people find this painful
– But it is not meant to be
– https://kb.vmware.com/kb/1008524 - "Collecting diagnostic information for VMware products“
– The above KB has links to every product (or should do – please report if not)
– If you have trouble collecting logs, then that’s a reason for an SR all by itself
• Which logs?
– For storage, almost always get vm-supports from ALL hosts in any cluster of interest
– If LUNs are presented to multiple clusters, then ALL hosts in EACH cluster
– Generally vCenter and vSphere client logs can be omitted
– But … it is the responsibility of the investigator (TSE) to prescribe which logs are needed
• Uploading
– https://kb.vmware.com/kb/2069559 - "Uploading diagnostic information for VMware through the Secure FTP portal"
– Make sure to use Binary transmission mode
– Make sure to change into the SR directory (after creating as necessary – directory name is SR #) before transferring files
![Page 45: Troubleshooting Storage Performance - WordPress.com · Agenda • The ESXi Storage Stack • Troubleshooting Performance • Recommended Practices, Tools and Tips • Steve’s 4-dimensional](https://reader034.vdocuments.mx/reader034/viewer/2022042612/5f754863aa7f752875426b5b/html5/thumbnails/45.jpg)
Bottom Line Principles
CONFIDENTIAL 45
• Optimally configured / engineered environments
– Should exhibit few, if any, latency alerts and/or VMFS heartbeat “timedout” events
– Log analysis can be done anytime, if problems are suspected
– Also, can be done when problems are NOT suspected – provides useful baseline info
– Better to cite evidence, than to throw darts
• Collaboration is key
– The root cause(s) of these issues are usually external to vSphere, BUT …
– ESXi log analysis can help direct the investigation, AND …
– Storage and fabric support teams are needed in addition to vSphere admins, AND …
– Vendors need to get engaged also
• It’s in everyone’s interest that things are smooth and stable
– Often, if these issues are chronic, word starts spreading that virtualized apps “can’t keep up” with physicals
– In most cases, that is no longer true
– And even if it is in some cases – we want to fix that!
![Page 46: Troubleshooting Storage Performance - WordPress.com · Agenda • The ESXi Storage Stack • Troubleshooting Performance • Recommended Practices, Tools and Tips • Steve’s 4-dimensional](https://reader034.vdocuments.mx/reader034/viewer/2022042612/5f754863aa7f752875426b5b/html5/thumbnails/46.jpg)
Community Resources
VMware’s Performance Technology Pages (Whitepapers Here)
– http://www.vmware.com/technical-resources/performance/resources.html
VMware’s Tech-Marketing Performance Blog
– http://blogs.vmware.com/vsphere/performance/
VMware’s Perf-Eng Blog (VROOM!)
– http://blogs.vmware.com/performance
Performance Community Forum
– http://communities.vmware.com/community/vmtn/general/performance
VMware Performance Links – Master List
– https://communities.vmware.com/docs/DOC-25253
Virtualizing Business Critical Applications
– http://www.vmware.com/solutions/business-critical-apps/
![Page 47: Troubleshooting Storage Performance - WordPress.com · Agenda • The ESXi Storage Stack • Troubleshooting Performance • Recommended Practices, Tools and Tips • Steve’s 4-dimensional](https://reader034.vdocuments.mx/reader034/viewer/2022042612/5f754863aa7f752875426b5b/html5/thumbnails/47.jpg)
Resources
VMware’s Performance – Technical Whitepapers
http://www.vmware.com/resources/techresources/cat/91,96
VMware’s Tech-Marketing Performance Blog
http://blogs.vmware.com/vsphere/performance/
VMware’s Perf-Eng Blog (VROOM!)
http://blogs.vmware.com/performance
Performance Community Forum
http://communities.vmware.com/community/vmtn/general/performance
VMware Performance Links – Master List
https://communities.vmware.com/docs/DOC-25253
Virtualizing Business Critical Applications
http://www.vmware.com/solutions/business-critical-apps/
CONFIDENTIAL 47
![Page 48: Troubleshooting Storage Performance - WordPress.com · Agenda • The ESXi Storage Stack • Troubleshooting Performance • Recommended Practices, Tools and Tips • Steve’s 4-dimensional](https://reader034.vdocuments.mx/reader034/viewer/2022042612/5f754863aa7f752875426b5b/html5/thumbnails/48.jpg)
Resources
Performance Best Practices
http://www.vmware.com/pdf/Perf_Best_Practices_vSphere5.5.pdf
http://www.vmware.com/content/dam/digitalmarketing/vmware/en/pdf/techpaper/vmware-perfbest-practices-vsphere6-0-white-paper.pdf
Troubleshooting Performance Related Problems in vSphere Environments
http://communities.vmware.com/docs/DOC-23094 (vSphere 5.x with vCOps)
CONFIDENTIAL 48
![Page 49: Troubleshooting Storage Performance - WordPress.com · Agenda • The ESXi Storage Stack • Troubleshooting Performance • Recommended Practices, Tools and Tips • Steve’s 4-dimensional](https://reader034.vdocuments.mx/reader034/viewer/2022042612/5f754863aa7f752875426b5b/html5/thumbnails/49.jpg)
Resources
Virtualizing Microsoft Business Critical Applications on VMware vSphere
by: Matt Liebowitz, Alexander Fontana
vSphere High Performance Cookbook
by: Prasenjit Sarkar
Troubleshooting Storage Performance
By: Mike Preston
VMware vSphere Performance: Designing CPU, Memory, Storage, and Networking for Performance-Intensive Workloads
By: Matt Liebowitz, Christopher Kusek, Rynardt Spies
Virtualizing SQL Server with VMware: Doing IT Right
By: Jeff Szastak, Michael Corey, Michael Webster
Virtualizing Oracle Databases on vSphere
By: Don Sullivan, Kannan Mani
VMware vRealize Operations Performance and Capacity Management
By: Ewan ‘e1’ Rahabok
CONFIDENTIAL 49
![Page 50: Troubleshooting Storage Performance - WordPress.com · Agenda • The ESXi Storage Stack • Troubleshooting Performance • Recommended Practices, Tools and Tips • Steve’s 4-dimensional](https://reader034.vdocuments.mx/reader034/viewer/2022042612/5f754863aa7f752875426b5b/html5/thumbnails/50.jpg)
Resources
VMware Hands-On-Labs
http://labs.hol.vmware.com/
HOL-SDC-1404:
vSphere Performance Optimization – This has always been one of the most popular labs and has content for both the beginner and the advanced vSphere Administrator. You can learn more about the basics of vSphere Performance or delve into esxtop, or vNUMA.
http://labs.hol.vmware.com/HOL/#lab/1474
CONFIDENTIAL 50