sto1315bu vsan troubleshooting deep dive or … · vsan troubleshooting deep dive vmworld 2017...
TRANSCRIPT
Francis Daly & Javier Menendez
STO1315BU
#VMworld #STO1315BU
vSAN Troubleshooting Deep Dive
VMworld 2017 Content: Not fo
r publication or distri
bution
• This presentation may contain product features that are currently under development.
• This overview of new technology represents no commitment from VMware to deliver these features in any generally available product.
• Features are subject to change, and must not be included in contracts, purchase orders, or sales agreements of any kind.
• Technical feasibility and market demand will affect final delivery.
• Pricing and packaging for any new technologies or features discussed or presented have not been determined.
Disclaimer
#STO1315BU CONFIDENTIAL 2
VMworld 2017 Content: Not fo
r publication or distri
bution
Agenda
1 Using common sense
2 vSAN Overview
3 vSAN Tools
4 Health
5 Use cases
6 Questions
#STO1315BU CONFIDENTIAL 3
VMworld 2017 Content: Not fo
r publication or distri
bution
Plan Your Build
#STO1315BU CONFIDENTIAL 4
VMworld 2017 Content: Not fo
r publication or distri
bution
Ensure You Have Backups (That Have Been Tested)
#STO1315BU CONFIDENTIAL 5
VMworld 2017 Content: Not fo
r publication or distri
bution
vSAN Cluster – The World is Your Oyster
Disk
Group
1
Disk
Group
2
Disk
Group
5…
SSD
1
SSD
or
HDD
1
SSD
or
HDD
2
SSD
or
HDD
7…
Cache
Capacity
Host 1Host 2 Host 3 Host 64…
10GbE
• Cluster: 2-64 physical hosts
• Host: 1-5 disk groups
• Disk Group:
– 1 flash device for cache
– 1-7 flash or HDD devices for capacity
#STO1315BU CONFIDENTIAL 6
VMworld 2017 Content: Not fo
r publication or distri
bution
Important Considerations
• Ensure all hardware is supported to ensure no performance degradation
• Disks/Controllers/Firmware/Drivers
• Objects:
• VM Home, VM Swap, VMDK
• Delta Disk, Memory Delta
vSphere vSAN
vSAN Datastore
#STO1315BU CONFIDENTIAL 7
VMworld 2017 Content: Not fo
r publication or distri
bution
What if I Choose Incorrect Policy?
• Applied at per VM level, or VMDK level
• Define protection level & performance
#STO1315BU CONFIDENTIAL 8
VMworld 2017 Content: Not fo
r publication or distri
bution
vSAN Objects and Components
• The vSAN datastore is an object store
• Each object made up of one or more components
• Policy will determine how components are distributed across cluster
C1
RAID-0 RAID-0
C2 W
RAID-0
200GB
RAID-1
FTT=1
Keeper_VM
#STO1315BU CONFIDENTIAL 9
VMworld 2017 Content: Not fo
r publication or distri
bution
vSAN Objects and Components
• The vSAN datastore is an object store
• Object store allows you to meet granular availability and performance requirements
• Each object made up of one or more components
• Data (components) is distributed across cluster based on VM storage policy
C1
RAID-0 RAID-0
C2 W
RAID-0
200GB
RAID-1
FTT=1
RAID-1 requires 2n+1 hostsNeed >50% of components for object
to remain active (quorum)
#STO1315BU CONFIDENTIAL 10
VMworld 2017 Content: Not fo
r publication or distri
bution
Network Partition
• The vSAN datastore is an object store
• Object store allows you to meet granular availability and performance requirements
• Each object made up of one or more components
• Data (components) is distributed across cluster based on VM storage policy
C1
RAID-0 RAID-0
C2 W
RAID-0
200GB
RAID-1
FTT=1
Witness or votes are used as tie
breaker if 2N+1 not satisfied#STO1315BU CONFIDENTIAL 11
VMworld 2017 Content: Not fo
r publication or distri
bution
vSAN Object Component States
C1
RAID-0
C2 C3 C1
RAID-0
C2 C3 W
RAID-0
RAID-1
FTT=1
• Active. Component accessible
• Absent. Inaccessible, but no explicit error codes sensed
– Ex: Host outage, or EMM with “ensure accessibility”
– Rebuild begins after 60 minute timeout window
• Degraded. Inaccessible, with error codes sensed
– Ex: Device failure
– Rebuild begins immediately.
• Can check state in UI,CLI,RVC and more
#STO1315BU CONFIDENTIAL 12
VMworld 2017 Content: Not fo
r publication or distri
bution
Tools
VMworld 2017 Content: Not fo
r publication or distri
bution
Useful Tools for Trouble-shooting
Health Check
RVCESXCLIvRealize Ops
vSAN Observer
#STO1315BU CONFIDENTIAL 14
VMworld 2017 Content: Not fo
r publication or distri
bution
ESXCLI
VMworld 2017 Content: Not fo
r publication or distri
bution
VC Unavailable – Now What?
#STO1315BU CONFIDENTIAL 16
VMworld 2017 Content: Not fo
r publication or distri
bution
Health Cluster List
#STO1315BU CONFIDENTIAL 17
VMworld 2017 Content: Not fo
r publication or distri
bution
Investigating Yellow State
Threshold is
30Disk A = 25% Disk B = 60% Delta = 35%
Rebalance
required
Consider performance impact
#STO1315BU CONFIDENTIAL 18
VMworld 2017 Content: Not fo
r publication or distri
bution
Where are the Most Serious Issues in my Cluster?
What is an object? vmdk,
vswp etc
#STO1315BU CONFIDENTIAL 19
VMworld 2017 Content: Not fo
r publication or distri
bution
So What Happened?
#STO1315BU CONFIDENTIAL 20
VMworld 2017 Content: Not fo
r publication or distri
bution
Diving Deeper
#STO1315BU CONFIDENTIAL 21
VMworld 2017 Content: Not fo
r publication or distri
bution
Fundamentals of vSAN ResyncsResync is like DNA Replication!
#STO1315BU CONFIDENTIAL 22
VMworld 2017 Content: Not fo
r publication or distri
bution
Resync Summary
Where else can I see this
vSphere Web Client
RVC: Resync Dashboard
Log Insight
vSAN Observer
#STO1315BU CONFIDENTIAL 23
VMworld 2017 Content: Not fo
r publication or distri
bution
Resync Summary
Do not
Reboot
hosts/Maintenance mode
Remove hosts/disks
Do
Configure hosts to reboot
after PSOD
Increase default clomd
repair time
#STO1315BU CONFIDENTIAL 24
VMworld 2017 Content: Not fo
r publication or distri
bution
vSAN Debug ListAre my disks keeping up?
Network
VM IO workload exceeds
bandwidth
Resync
Throttling will prioritize
VM IO’s
#STO1315BU CONFIDENTIAL 25
VMworld 2017 Content: Not fo
r publication or distri
bution
Performance BottlenecksFast disks but still seeing large queues and failed IO’s
Manage expectations Controller queue depth
SSD gradeNetwork setup
#STO1315BU CONFIDENTIAL 26
VMworld 2017 Content: Not fo
r publication or distri
bution
RVC
VMworld 2017 Content: Not fo
r publication or distri
bution
vsan.check_state
• What state is my cluster in?
• I want to quickly know if there is anything serious happening in my cluster
#STO1315BU CONFIDENTIAL 28
VMworld 2017 Content: Not fo
r publication or distri
bution
vsan.disks_stats 0
KB 2145267 – Understand vSAN on-disk format
#STO1315BU CONFIDENTIAL 29
VMworld 2017 Content: Not fo
r publication or distri
bution
vsan.cluster_info
#STO1315BU CONFIDENTIAL 30
VMworld 2017 Content: Not fo
r publication or distri
bution
Health UI
VMworld 2017 Content: Not fo
r publication or distri
bution
Cluster Level Quick Peek
CEIP will need to be enabled
#STO1315BU CONFIDENTIAL 32
VMworld 2017 Content: Not fo
r publication or distri
bution
Issues at Cluster Level
#STO1315BU CONFIDENTIAL 33
VMworld 2017 Content: Not fo
r publication or distri
bution
vSAN Disk Balance
#STO1315BU CONFIDENTIAL 34
VMworld 2017 Content: Not fo
r publication or distri
bution
What to Do if Rebalance is Required?
#STO1315BU CONFIDENTIAL 35
VMworld 2017 Content: Not fo
r publication or distri
bution
vSAN Disk Balance
#STO1315BU CONFIDENTIAL 36
VMworld 2017 Content: Not fo
r publication or distri
bution
What Happens During a Rebalance?
#STO1315BU CONFIDENTIAL 37
VMworld 2017 Content: Not fo
r publication or distri
bution
Rebalance in Action
#STO1315BU CONFIDENTIAL 38
VMworld 2017 Content: Not fo
r publication or distri
bution
Post Rebalance
#STO1315BU CONFIDENTIAL 39
VMworld 2017 Content: Not fo
r publication or distri
bution
Health ServiceCLI
VMworld 2017 Content: Not fo
r publication or distri
bution
vCenter vSAN Health Status Script
• Provides a text output for all the Health Tests
• Might be useful to run on the vCenter as the log bundles are being collected
• Manually copy off the text file
python /usr/lib/vmware-vpx/vsan-health/vsan-vc-health-status.py > /tmp/vsan-vc-health-status.txt
#STO1315BU CONFIDENTIAL 41
VMworld 2017 Content: Not fo
r publication or distri
bution
vCenter vSAN Health Status Script
Runs the following RVC commands and collects their output
• vsan.cluster_info on each Cluster
• vsan.host_info for each host in every cluster
• vsan.vm_object_info on a host-per-host basis
• vsan.disks_info for each host on a per-cluster basis
• vsan.disks_stats
• vsan.check_limits
• vsan.check_state
• vsan.lldpnetmap
• vsan.obj_status_report
• vsan.resync_dashboard
• vsan.disk_object_info
What else does it do for me?
How much time would that save you on a call?
#STO1315BU CONFIDENTIAL 42
VMworld 2017 Content: Not fo
r publication or distri
bution
ESXi vSAN Health Status Script
• Command-line tool to display the vSAN health for a particular node
What is the vSAN Health Status script?
• On each host individually, not vCenter• /usr/lib/vmware/vsan/bin/vsan-health-status.pyc
Where do I find it?
• [root@esxi] python /usr/lib/vmware/vsan/bin/vsan-health-
status.pyc
How do I run it?
See KB 2107705
#STO1315BU CONFIDENTIAL 43
VMworld 2017 Content: Not fo
r publication or distri
bution
When to Use it
• The Health Service on vCenter is not available
• You can check the Health of individual nodes and their components
• Output is quite different to vCenter version
When vCenter is down
#STO1315BU CONFIDENTIAL 44
VMworld 2017 Content: Not fo
r publication or distri
bution
ESXi vSAN Health Status script – What Will I Find in the Output
vSAN HCL related hardware info
• Displays key information about controllers to check against HCL
Limits summary
• How many components vs limit on the host
• How much space is consumed by components
Network summary
• Subnet used by vSAN
• Multicast addresses
Physical VSAN disk summary
• Info about physical disks including cmmds UUIDs
#STO1315BU CONFIDENTIAL 45
VMworld 2017 Content: Not fo
r publication or distri
bution
Missing Disks
VMworld 2017 Content: Not fo
r publication or distri
bution
I Added Hosts to my Cluster, Not Seeing Space
#STO1315BU CONFIDENTIAL 47
VMworld 2017 Content: Not fo
r publication or distri
bution
Which Tools/Logs Do I Need
Track down missing MD’s
What is vobd.log?
What is boot.log?
What happens in cmmds?
esxcli vsan storage list – check if in cmmds
#STO1315BU CONFIDENTIAL 48
VMworld 2017 Content: Not fo
r publication or distri
bution
esxcli vsan Storage List
#STO1315BU CONFIDENTIAL 49
VMworld 2017 Content: Not fo
r publication or distri
bution
What to Do Next
Take a disk naa ID that is not “In CMMDS”
cd /var/run/log
less vobd.log
/”Disk UUID 12345678”
Disk not found
#STO1315BU CONFIDENTIAL 50
VMworld 2017 Content: Not fo
r publication or distri
bution
What to Do Next
cd var/log
cat boot.log |less
/”Disk UUID 12345678”
LVM: 8355:Device naa.”Disk UUID 12345678” detected to be a snapshot:
#STO1315BU CONFIDENTIAL 51
VMworld 2017 Content: Not fo
r publication or distri
bution
What Happened and Resolution
The disk cannot be added to the cluster because there is a file system on it
Disk was given a UUID, It was in a cluster, and used at some point
Verified with customer we could delete data
Used PartedUtil to kill partitions
Delete disk groups and create new disk groups
#STO1315BU CONFIDENTIAL 52
VMworld 2017 Content: Not fo
r publication or distri
bution
Poor VM performanceSome VM’s are performing poorly and these VM’s are sitting on the vSAN datastore, this has resulted in effective data unavailability. Performance improved after some VSAN disks entered a failed state
VMworld 2017 Content: Not fo
r publication or distri
bution
What Happened to the Disk?
less vmkernel.log | grep failed
Check the sense codes – scsi decoder
#STO1315BU CONFIDENTIAL 54
VMworld 2017 Content: Not fo
r publication or distri
bution
What Else Happened? Controllers
Host ID
#STO1315BU CONFIDENTIAL 55
VMworld 2017 Content: Not fo
r publication or distri
bution
Sense Code Output
#STO1315BU CONFIDENTIAL 56
VMworld 2017 Content: Not fo
r publication or distri
bution
What Happened?
Three disks failed in short space of time
Controllers were aborting
What would you do?
#STO1315BU CONFIDENTIAL 57
VMworld 2017 Content: Not fo
r publication or distri
bution
VMware vSAN Training
Training
VMware vSAN: Deploy and Manage [V6.6]
• Classroom
• Live Online
• Onsite
More Options
On Demand Courses
• Self-paced learning
• Meets certification requirements
VMware Learning Zone
• vSAN Troubleshooting
• Cloud-based learning
• Supplements traditional training
View complete options and descriptions:www.vmware.com/education #STO1315BU CONFIDENTIAL 58
VMworld 2017 Content: Not fo
r publication or distri
bution
Global Support Services
Learn more about how VMware is radically transforming Customer Support through VMware Skyline™ technology.
• See a demo in the VMware booth in the Solutions Exchange
• Sign up for a Meet the Experts roundtable in the Schedule Builder on the VMworld mobile app or visit the Meet the Experts, Level 2
• Visit www.vmware.com/support/service/skyline
#STO1315BU CONFIDENTIAL 59
VMworld 2017 Content: Not fo
r publication or distri
bution
Additional Education Resources
At VMworld 2017
• Education & Certification Lounge: VM Village
• Certification Exam Center: Jasmine EFG, Level 3
Online
• VMware Training: www.vmware.com/education
• VMware Certification: www.vmware.com/certification
Save 50% off VCP & VCAP
exams at
VMworld 2017
#STO1315BU CONFIDENTIAL 60
VMworld 2017 Content: Not fo
r publication or distri
bution
Prove Your Expertise with VMware Digital Badges
What are VMware Digital Badges?
• Single source that combines your credential and
a complete overview of your skills
• Way for you to easily share your accomplishments
in social media
• Provides employers with easy, valid verification of
VMware credentials
Digital Badges Online
• vmware.com/go/vSAN2017badge
• vmware.com/go/VxRail2017badge
VMware vSAN 2017 Specialist
Dell EMC VMwareCo-Skilled - VxRail 2017
NEW!VMware Digital Badges
Prove your expertiseon vSAN and
vSAN HCI environments
#STO1315BU CONFIDENTIAL 61
VMworld 2017 Content: Not fo
r publication or distri
bution
VMworld 2017 Content: Not fo
r publication or distri
bution
VMworld 2017 Content: Not fo
r publication or distri
bution