Log congestion due to a single failed drive
search cancel

Log congestion due to a single failed drive

book

Article ID: 326809

calendar_today

Updated On:

Products

VMware vSAN

Issue/Introduction

Symptoms:
  • Multiple VMs may crash due to very high latency / log congestion.
  • In vSAN Skyline Health, you will observe a single drive failure and log congestion.
  • From vmkernel.log; you may notice stuck descriptor events:
DOM: DOM2PCPrintDescriptor:1797: [105568173:0x4313fe8f3718] => Stuck descriptor
  • In vobd.log, you will observe the affected disk hit transient errors.
2022-05-31T11:42:46.065Z: [vSANCorrelator] 10605891965954us: [vob.vsan.lsom.devicerepair] vSAN device 521a74ce-c980-c16c-ff3d-38a036233daf is being repaired due to I/O failures, and will be out of service until the repair is complete. If the device is part of a dedup disk group, the entire disk group will be out of service until the repair is complete.
2022-05-31T11:42:46.065Z: [vSANCorrelator] 10606062774178us: [esx.problem.vob.vsan.lsom.devicerepair] Device 521a74ce-c980-c16c-ff3d-38a036233daf is in offline state and is getting repaired
  • In vsandevicemonitord.log showing DDH tried repairing/re-mounting the disk but that failed continuously due to sustained errors from the device:
2022-06-03 01:44:16,575 INFO vsandevicemonitord stderr None, stdout b"VsanUtil::ReadFromDevice: Failed to open /vmfs/devices/disks/naa.500a0751281163a2, errno (5)\nVsanUtil::GetVsanDisks: Error occurred 'Failed to open device /vmfs/devices/disks/naa.500a0751281163a2', create disk with null id\nVsanUtil::ReadFromDevice: Failed to open /vmfs/devices/disks/naa.500a0751281163a2, errno (5)\nErrors: \nUnable to mount: Failed to open device /vmfs/devices/disks/naa.500a0751281163a2\n" from command /sbin/localcli vsan storage diskgroup mount -d naa.500a0751281163a2.

2022-06-03 01:44:16,575 INFO vsandevicemonitord Mounting failed on VSAN device naa.500a0751281163a2.
2022-06-03 01:44:16,575 INFO vsandevicemonitord Repair attempt 131 for device 521a74ce-c980-c16c-ff3d-38a036233daf
 
  • vSAN performance graphs may show high congestion.


Environment

VMware vSAN 7.0.x
VMware vSAN 6.7.x

Cause

As the RELOG on the failed disk did not happen, this led to PLOG build-up leading to congestion and latencies at the VM level..
RELOG is an internal process of vSAN which is used to free up the space in LSOM layer for log reclamation.
RELOG doesn't happen on device if device remains in repair state for long time which might lead to log buildup.

 

Resolution

  • The issue has been fixed in 6.7 U3 P05 and 7.0 U3D and higher respectively.


Workaround:
If you notice any drive reporting an "Operational health error" in Skyline Health and it matches the instances mentioned above, then follow the below steps:
  • Put the affected host into Maintenance mode choosing "ensure object accessibility";
  • Remove the faulty disk from the disk-group
  • Replace the failed drive and add the new drive to the disk-group


Additional Information

  • The above behavior is reported in ESXi 6.7.x and "7.0 GA / 7.0 U1"
  • After applying the fix, vSAN shall process relog on the disk under repair to avoid PLOG log build-up.


Impact/Risks:
  • Performance issue on cluster due to log congestion.
  • These issues have been reported in non-dedup disk-groups