This article provides information on resolving stuck I/O on a vSAN environment.Symptoms:
If I/O is stuck or lost on the storage controller or the storage disk, the ESXi storage stack will try to abort them using the task management request displaying these console messages:
2021-06-22T12:02:08.408Z cpu30:1001397101)ScsiDeviceIO: PsaScsiDeviceTimeoutHandlerFn:12834: TaskMgmt op to cancel IO succeeded for device naa.55cd2e404b7736d0 and the IO did not complete. WorldId 0, Cmd 0x28, CmdSN = 0x428.Cancelling of IO will be
2021-06-22T12:02:08.408Z cpu30:1001397101)retried.If such a lost I/O is found on a host, vSAN will offline the disk to ensure that it doesn't affect other hosts on the cluster.
We see the following alert in
/var/run/log/vobd.log:
2021-06-22T12:04:04.236Z: [vSANCorrelator] 19607827057us: [vob.vsan.lsom.stuckiooffline] vSAN device 5296eb1f-c017-68b0-9c97-dea29ae522f8 detected stuck I/O error. Marking the device as offline.
2021-06-22T12:04:04.237Z: [vSANCorrelator] 19607829404us: [esx.problem.vob.vsan.lsom.stuckiooffline] vSAN device 5296eb1f-c017-68b0-9c97-dea29ae522f8 detected stuck I/O error. Marking the device as offlineIf the cache device in non-dedup disk group encounters stuck I/O or if any of the disk in dedup disk group encounters stuck I/O, the entire disk group will be set to offline state.
2021-06-22T12:04:04.236Z: [vSANCorrelator] 19607827040us: [vob.vsan.lsom.stuckiopropagated] vSAN device 52e9c739-e025-c001-eb29-62d02f0df0bc is under propagated stuck I/O error. Marking the device as offline.
2021-06-22T12:04:04.236Z: [vSANCorrelator] 19607828405us: [esx.problem.vob.vsan.lsom.stuckiopropagated] vSAN device 52e9c739-e025-c001-eb29-62d02f0df0bc is under propagated stuck I/O error. Marking the device as offline.In vCenter you'll see
The following health alert is also shown if the cache device in non-dedup DG encounters stuck IO. A similar health alert is also shown if any of the disks in dedup disk group encounters stuck IO.
esxcli vsan health cluster get -t 'Operation health'
Operation health redChecks the operation health of the physical disks for all hosts in the vSAN cluster.
Ask VMware:
http://www.vmware.com/esx/support/askvmware/index.php?eventtype=com.vmware.vsan.health.test.physdiskoverallDisks with issues
Host Disk Overall health Metadata health Operational health In CMMDS/VSI Operational State Description Recommendation UUID
10.158.64.25 Local ATA Disk (naa.55cd2e404b766b2c) red red red Yes/Yes Stuck I/O is detected Migrate the workload and power cycle the host 52e9c739-e025-c001-eb29-62d02f0df0bc
10.158.64.25 Local ATA Disk (naa.55cd2e404b7733c8) red red red Yes/Yes Stuck I/O is detected Migrate the workload and power cycle the host 52f4590c-149f-3e04-2e48-26249e39f8e6
10.158.64.25 Local ATA Disk (naa.55cd2e404b7736d0) red red red Yes/Yes Stuck I/O is detected Migrate the workload and power cycle the host 5296eb1f-c017-68b0-9c97-dea29ae522f8
vSAN Skyline Health in vCenter