How to troubleshoot vSAN OSA disk issues

search cancel

How to troubleshoot vSAN OSA disk issues

book

Article ID: 326859

calendar_today

Updated On:

Products

VMware vSAN

Issue/Introduction

Symptoms:

There are different scenarios on how a vSAN disk group (DG) will respond to a certain failure:

In case were "Deduplication and Compression" is enabled:

Cache disk failure --> The whole disk group will be down.
Capacity disk failure --> The whole disk group will be down.

In case were "Deduplication and Compression" is not enabled:

Cache disk failure --> The whole disk group will be down.
Capacity disk failure --> Only the failed disk will be down.

Environment

VMware vSAN 6.x
VMware vSAN 8.0.x
VMware vSAN 7.0.x

Resolution

In this article, we will be addressing the following:

How to identify if your DG is using deduplication and compression or not?
How to identify the failed disks?
How to verify the disk status?
How to identify the physical location of the failed disks?
How to respond?
How to identify the time when the disk failed and if there are any SCSI errors reported on the failed disk?
Common disk failure reasons.

How to identify if your disk group is using deduplication and compression or not?

From the vCenter Web Client (Cluster --> Monitor --> vSAN --> Capacity), if deduplication is enabled, you will see "Deduplication and compression savings" ratio

Or from Cluster > Configure > vSAN > Services > Data Services > Space efficiency
Or from ESXi CLI, open a putty/ssh session to any host in the cluster and run the following command:

esxcli vsan storage list | grep -i dedup
Deduplication: true
Deduplication: true
Deduplication: true
Deduplication: true
Deduplication: true
Deduplication: true

How to identify the failed disk?

In case deduplication & compression is not enabled, go to vCenter and check "Operation health" under "Skyline Health" [Cluster --> Monitor --> vSAN --> Skyline Health --> Physical disk --> Operation health]

In case deduplication & compression is enabled, any disk failure will cause the whole DG will be offline. In this situation, ensure to scroll to the right to view the "Operational State Description" to see which disk is failed. The disk marked as Permanent disk failure is the failed disk.

(Reference KB vSAN Deduplication enabled -- Identifying Failed Disk (2149067)).

How to verify the disk status?

To confirm if the disk or disk group is currently mounted or still down, go to vCenter and check "Disk Management" [Cluster --> Configure --> vSAN --> Disk Management]

How to identify the physical location of the failed disks?
In case the failed disk is not in "Absent" state:

Run the following command using the device identifier of the failed disk typically naa.xxxx, eui.xxx, or mpx.xxx

esxcli storage core device physical get -d naa.xxxxx
Physical Location: enclosure 1, slot 0

Or click on "TURN ON LED" for the failed disk [Connect to vCenter, click on the host with the failed disk --> Configure --> Storage devices --> Mark the failed disk --> "TURN ON LED"]

Note: This doesn't always work as it's dependent if the controller firmware supports this feature.

In case the failed disk is in "Absent" state do one of the following:

Follow KB vSAN Deduplication enabled -- Identifying Failed Disk (2149067)
If for some reason the above KB doesn't help with identifying the failed disk you can identify the physical slot in this state by clicking on "TURN ON LED" for all the working disks or identify the physical slots (esxcli storage core device physical get -d "naa.xxxx") of all the working disks, so we can eliminate the physical location of all the working disks, hence, conclude the physical location of the failed disk.

Note: As previously mentioned "TURN ON LED" doesn't always work as it's dependent on the controller firmware supporting this feature. If there are a lot of disks in the server this can be time consuming running the command on all disks unless you script it out.

How to respond?

First, we need to make sure there is no inaccessible object. You can check it from the vCenter under "Virtual Objects" [Cluster --> Monitor --> vSAN --> Virtual Objects]

You can also check it from CLI using the following command: esxcli vsan debug object health summary get.
In case there are inaccessible objects, open a case with VMware vSAN support.

Based on the cause of the disk failure, an activity might need to be done on the host (Example: reboot the host, replace/re-insert the failed disk, or recreate DG). As a best practice, place the host in maintenance mode with ensure accessibility before proceeding with any activity (Note: run the evacuation precheck, to understand if it will require to migrate data or not).

Important note: If you placed the host in maintenance mode with ensure accessibility for more than the the default repair time of 60 minutes, vSAN will start a re-sync operation to rebuild the data residing on all DGs on that host. In case the activity will take more than the configured repair time, you can increase the configured repair time temporarily to avoid unneeded re-sync operation. (To change the repair time, please reference KB Changing the default repair delay time for a host failure in vSAN (2075456).

How to identify the time when the disk failed and if there are any SCSI errors reported on the failed disk?

Open SSH/Putty session to the ESXi host and run the following command to identify the time-stamp when the disk failed:

egrep -i "perm|offline|unhealthy" /var/log/vobd.log (You can also search on the disk UUID)
2022-10-12T20:03:18.694Z: [vSANCorrelator] 27997683071354us: [esx.problem.vob.vsan.lsom.devicerepair] Device xxxxxxxx is in offline state and is getting repaired
2022-10-12T20:46:00.111Z: [vSANCorrelator] 28000195517670us: [vob.vsan.lsom.diskerror] vSAN device xxxxxxxx is under permanent error.

Note: If the disk hasn't failed recently or if vobd is chatty you may need to look at older logs. To do this run zcat /var/run/log/vobd.*.gz|egrep -i "perm|offline|unhealthy"

Then run the following command to identify any read/write commands failing at the same time stamp collected from the previous step:

grep <disk device> /var/log/vmkernel.log
2022-10-12T20:03:14.424Z cpu5:2098330)ScsiDeviceIO: 4325: Cmd(0x45be74dae040) 0x28, CmdSN 0xd65263a8 from world 0 to dev "naa.xxxxxxxx" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x3 0x11 0x0
2022-10-12T20:46:00.107Z cpu86:2098331)ScsiDeviceIO: 4277: Cmd(0x45de6e7527c0) 0x28, CmdSN 0xcebe from world 0 to dev "naa.xxxxxxxx" failed H:0xc D:0x0 P:0x0

Note: If the disk hasn't failed recently or if vmkernel is chatty you may need to look at older logs. To do this run zcat /var/run/log/vmkernel.*.gz|grep <disk device>

Common disk failure reasons:

(1) Disk is soft failed:

Troubleshooting steps:

Check KVM (iDrac, iLo) for any issues with disks/controller
Check the logs for any SCSI error codes
Check controller driver/firmware to ensure they're not down rev or in an unsupported combination by running esxcli vsan debug controller list -v=true then check the vSAN HCL if the drivers/firmware are down rev or not in a supported combination then upgrade them.
If no issues are found in the KVM or no SCSI error codes then this is a soft fail of a disk and reboot of the host may bring the disk(s)/DG back online place the host into maintenance mode with ensure accessibility, then reboot the host (If the disk(s)/DG comes back great, otherwise if it doesn't engage the hardware vendor for replacement/further assistance).

(2) Hardware issue: Valid sense data: 0x4 0x0 0x0

Example of SCSI error from log file: /var/run/log/vmkernel.log:
2021-01-05T08:37:16.337Z cpu26:2098033)ScsiDeviceIO: 3047: Cmd(0x45a3e27a1700) 0x2a, CmdSN 0x2238d from world 2960707 to dev "naa.xxxx" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x4 0x0 0x0.

Troubleshooting steps:

Disk needs to be replaced by the Hardware vendor.

(3) Medium error: Valid sense data: 0x3 0x11 0x0

Unrecovered Read Error (URE) is a type of a medium error that occurs when the esxi host try to read from a bad block on the disk. For more information about URE, please reference the following document "Improving vSAN's Resilience against Unrecovered Read Errors on Devices".
URE can occur in the metadata region or the data region.
If URE occurs in the data region of the disk open case with VMware vSAN support for further assistance.
If URE occurred in the metadata region, as of ESXi/vSAN 6.7 P03 and 7.0 Update 1 and newer a feature called autoDG Creation was introduced for All Flash DG and vSAN Skyline Health reports that the disk is unhealthy and will reallocate the blocks marking the bad blocks for non-use. See KB vSAN Disk Or Diskgroup Fails With Medium Errors (81121) for more details.

Example of scsi error from log file : /var/run/log/vmkernel.log:
2022-10-12T19:36:55.253Z cpu11:2098330)ScsiDeviceIO: 4325: Cmd(0x45bea479ec40) 0x28, CmdSN 0xfaf from world 0 to dev "naa.xxxx" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x3 0x11 0x0

Troubleshooting steps:

In case "Hybrid vSAN" is used:

Disks are HDDs, then the bad disk will need to be replaced by the Hardware vendor.

(4) vSAN Dying Disk Handling (DDH) feature unmounts the bad disk or reports it unhealthy

DDH feature in vSAN continuously monitors the health of disks and disk groups in order to detect an impending disk failure or a poorly performing disk group (For more information about DDH, please reference the following: KB Dying Disk Handling (DDH) in vSAN (2148358) and the following document "vSAN Degraded Device Handling").
DDH unmounts the disk or reports it unhealthy in the following situations :

High write IO latency on one of the vSAN disks.
Maximum log congestion threshold reached on one of the DG.
IMPENDING FAILURE reported on one of the vSAN disks (We can see the health status of the disk using the following command : localcli storage core device smart get -d naa.xxx),

Example:
localcli storage core device smart get -d naa.xxxxx
SMART Data for Disk : naa.xxxxx
Parameter                     Value Threshold Worst
-----------------------------------------------------
Health Status                   IMPENDING FAILURE       N/A     N/A
Media Wearout Indicator         N/A     N/A     N/A
Write Error Count               0       N/A     N/A
Read Error Count                369     N/A     N/A
Power-on Hours                  N/A     N/A     N/A
Power Cycle Count               47      N/A     N/A
Reallocated Sector Count        N/A     N/A     N/A
Raw Read Error Rate             N/A     N/A     N/A
Drive Temperature               30      N/A     N/A
Driver Rated Max Temperature    N/A     N/A     N/A
Write Sectors TOT Count         N/A     N/A     N/A
Read Sectors TOT Count          N/A     N/A     N/A
Initial Bad Block Count         N/A     N/A     N/A
-----------------------------------------------------

Example from log file: /var/run/log/vsandevicemonitord.log:
WARNING - WRITE Average Latency on VSAN device <NAA disk name> has exceeded threshold value <IO latency threshold for disk> us <# of intervals with excessive IO latency> times.
WARNING - Maximum log congestion on VSAN device <NAA disk name> <current intervals with excessive log congestion>/<intervals required to be unhealthy>
WARNING - SMART health status for disk naa.xxxxx is IMPENDING FAILURE.

Troubleshooting steps :

Check if the failed disk is facing any hardware or medium errors (reference the above steps).
Run the following command: esxtop, option "u" on the host with the failed disk, and check the "DAVG" for the failed disk to see if there is any high latency reported on that disk. If there's high latency seen, engage the hardware vendor. For more information on how to check DAVG using esxtop, reference the following KB Using esxtop to identify storage performance issues for ESX / ESXi (multiple versions) (1008205).
Check the compatibility of the controller driver and firmware, and also check if the disk is a vSAN supported device and if its firmware version is supported (Reference vSAN HCL Link).
If there are no compatibility issues, engage the hardware vendor to check for any firmware issues on the controller or disks.

(5) Read/write commands failing with Aborts/RETRY: H:0x5 & H:0xc

Example from log file: /var/run/log/vmkernel.log:
2022-10-21T02:50:51.069Z cpu0:2098435)ScsiDeviceIO: 3501: Cmd(0x45a203564900) 0x28, cmdId.initiator=0x45223c91a7f0 CmdSN 0xaa97f from world 0 to dev "naa.xxxx" failed H:0x5 D:0x0 P:0x0 Aborted at driver layer. Cmd count Active:2 Queued:0

2022-10-21T04:41:13.494Z cpu0:2098435)ScsiDeviceIO: 3463: Cmd(0x45aa8ffdedc0) 0x28, CmdSN 0x2 from world 2102512 to dev "naa.xxxxx" failed H:0xc D:0x0 P:0x0 Invalid sense data: 0x0 0x0 0x0.

Troubleshooting steps:

Check the compatibility of the controller driver and firmware, and also check if the disk is a vSAN supported device and if its firmware version is supported (Reference vSAN HCL Link).
If there are no compatibility issues, engage the hardware vendor to check for any firmware issues on the controller or disks.

Feedback

thumb_up Yes

thumb_down No