Log congestion due to a single failed drive

Products

VMware vSAN

Issue/Introduction

Symptoms:

Multiple VMs may crash due to very high latency / log congestion.
In vSAN Skyline Health, you will observe a single drive failure and log congestion.
From vmkernel.log; you may notice stuck descriptor events:

DOM: DOM2PCPrintDescriptor:1797: [105568173:0x4313fe8f3718] => Stuck descriptor

In vobd.log, you will observe the affected disk hit transient errors.

2022-05-31T11:42:46.065Z: [vSANCorrelator] 10605891965954us: [vob.vsan.lsom.devicerepair] vSAN device 521a74ce-c980-c16c-ff3d-38a036233daf is being repaired due to I/O failures, and will be out of service until the repair is complete. If the device is part of a dedup disk group, the entire disk group will be out of service until the repair is complete.
2022-05-31T11:42:46.065Z: [vSANCorrelator] 10606062774178us: [esx.problem.vob.vsan.lsom.devicerepair] Device 521a74ce-c980-c16c-ff3d-38a036233daf is in offline state and is getting repaired

In vsandevicemonitord.log showing DDH tried repairing/re-mounting the disk but that failed continuously due to sustained errors from the device:

2022-06-03 01:44:16,575 INFO vsandevicemonitord stderr None, stdout b"VsanUtil::ReadFromDevice: Failed to open /vmfs/devices/disks/naa.500a0751281163a2, errno (5)\nVsanUtil::GetVsanDisks: Error occurred 'Failed to open device /vmfs/devices/disks/naa.500a0751281163a2', create disk with null id\nVsanUtil::ReadFromDevice: Failed to open /vmfs/devices/disks/naa.500a0751281163a2, errno (5)\nErrors: \nUnable to mount: Failed to open device /vmfs/devices/disks/naa.500a0751281163a2\n" from command /sbin/localcli vsan storage diskgroup mount -d naa.500a0751281163a2.

2022-06-03 01:44:16,575 INFO vsandevicemonitord Mounting failed on VSAN device naa.500a0751281163a2.
2022-06-03 01:44:16,575 INFO vsandevicemonitord Repair attempt 131 for device 521a74ce-c980-c16c-ff3d-38a036233daf

vSAN performance graphs may show high congestion.

Environment

VMware vSAN 7.0.x
VMware vSAN 6.7.x

Cause

As the RELOG on the failed disk did not happen, this led to PLOG build-up leading to congestion and latencies at the VM level..
RELOG is an internal process of vSAN which is used to free up the space in LSOM layer for log reclamation.
RELOG doesn't happen on device if device remains in repair state for long time which might lead to log buildup.

Resolution

The issue has been fixed in 6.7 U3 P05 and 7.0 U3D and higher respectively.

Workaround:
If you notice any drive reporting an "Operational health error" in Skyline Health and it matches the instances mentioned above, then follow the below steps:

Put the affected host into Maintenance mode choosing "ensure object accessibility";
Remove the faulty disk from the disk-group
Replace the failed drive and add the new drive to the disk-group

Additional Information

The above behavior is reported in ESXi 6.7.x and "7.0 GA / 7.0 U1"
After applying the fix, vSAN shall process relog on the disk under repair to avoid PLOG log build-up.

Impact/Risks:

Performance issue on cluster due to log congestion.
These issues have been reported in non-dedup disk-groups