Symptoms:
- Multiple VMs may crash due to very high latency / log congestion.
- In vSAN Skyline Health, you will observe a single drive failure and log congestion.
- From vmkernel.log; you may notice stuck descriptor events:
DOM: DOM2PCPrintDescriptor:1797: [105568173:0x4313fe8f3718] => Stuck descriptor
- In vobd.log, you will observe the affected disk hit transient errors.
2022-05-31T11:42:46.065Z: [vSANCorrelator] 10605891965954us: [vob.vsan.lsom.devicerepair] vSAN device 521a74ce-c980-c16c-ff3d-38a036233daf is being repaired due to I/O failures, and will be out of service until the repair is complete. If the device is part of a dedup disk group, the entire disk group will be out of service until the repair is complete.
2022-05-31T11:42:46.065Z: [vSANCorrelator] 10606062774178us: [esx.problem.vob.vsan.lsom.devicerepair] Device 521a74ce-c980-c16c-ff3d-38a036233daf is in offline state and is getting repaired
- In vsandevicemonitord.log showing DDH tried repairing/re-mounting the disk but that failed continuously due to sustained errors from the device:
2022-06-03 01:44:16,575 INFO vsandevicemonitord stderr None, stdout b"VsanUtil::ReadFromDevice: Failed to open /vmfs/devices/disks/naa.500a0751281163a2, errno (5)\nVsanUtil::GetVsanDisks: Error occurred 'Failed to open device /vmfs/devices/disks/naa.500a0751281163a2', create disk with null id\nVsanUtil::ReadFromDevice: Failed to open /vmfs/devices/disks/naa.500a0751281163a2, errno (5)\nErrors: \nUnable to mount: Failed to open device /vmfs/devices/disks/naa.500a0751281163a2\n" from command /sbin/localcli vsan storage diskgroup mount -d naa.500a0751281163a2.
2022-06-03 01:44:16,575 INFO vsandevicemonitord Mounting failed on VSAN device naa.500a0751281163a2.
2022-06-03 01:44:16,575 INFO vsandevicemonitord Repair attempt 131 for device 521a74ce-c980-c16c-ff3d-38a036233daf
- vSAN performance graphs may show high congestion.