Search the VMware Knowledge Base (KB)
View by Article ID
Dying Disk Handling (DDH) in vSAN 6.6 (2148358)
The Dying Disk Handling (DDH) feature in vSAN continuously monitors the health of disks/diskgroups in order to detect an impending disk/diskgroup failure or a poorly performing diskgroup. DDH diagnoses disk/diskgroup health by detecting either excessive IO latency for a vSAN disk or maximum log congestion that vSAN determines to be due to log leak issues in a vSAN diskgroup over an extended period. Unhealthy disks/diskgroups are marked as such and at this point the disks/diskgroups are no longer used for new data placement. If an unhealthy disk belongs to a deduplication enabled diskgroup then the whole diskgroup is marked as unhealthy. vSAN action for the data on these disks/diskgroups depends on the configured policy and compliance state of objects that have their components on these disks/diskgroups. If a component on the unhealthy disk/diskgroup belongs to an object that can tolerate the failure of this disk/diskgroup then vSAN will immediately mark that component as “absent” to avoid any impact on the performance of writes to that object. This means that the object is in a failure condition and will not be able to tolerate any additional failures if it is configured with the default policy of failuresToTolerate = 1. Such components are fixed lazily by vSAN (after a 60minute timeout). Furthermore, if a component on the dying disk/diskgroup is required to maintain availability or quorum of a vSAN object, evacuation is triggered immediately. vSAN applies a best effort procedure to evacuate all the “active” components from a “dying” disk but this process may fail if there are not enough resources in the cluster or if the components belong to inaccessible objects.
When DDH detects that a disk has exceeded the IO latency threshold during the monitoring interval vSAN will generate a VOB and log a message to the vsandevicemonitord.log file in the /var/run/log directory. The log entry below is an example for a disk that needs to be replaced once the required data evacuation is complete and the disk is in an "evacuated" state:
WARNING - WRITE Average Latency on VSAN device <NAA disk name> has exceeded threshold value <IO latency threshold for disk> us <# of intervals with excessive IO latency> times.
When DDH detects that a caching tier has excessive log congestion during the monitoring interval vSAN will generate a VOB and log to the vsandevicemonitord.log file. Excessive log congestion messages are in this format:
WARNING - Maximum log congestion on VSAN device <NAA disk name> <current intervals with excessive log congestion>/<intervals required to be unhealthy>
In both of these situations, vSAN triggers the evacuation of some or all data from the affected disk/diskgroups, the “overall disks health” section in the vSAN health monitoring UI reports any of the following operational states for the affected disks/diskgroups along with recommendations for the user. The recommendations after the evacuation is complete will differ depending on whether vSAN detected excessive IO latencies or excessive log congestion.
This is a list of the failures the vSAN Health monitoring UI will report:
- Impending permanent disk failure, data is being evacuated (Health state - Yellow)
vSAN is evacuating the required data from this disk due to an impending permanent disk failure state. As long as sufficient resources are available in the rest of the vSAN cluster, vSAN will successfully evacuate all the “healthy” components from the “dying” disk but this may cause an increase in the overall datastore usage. If the cause here is excessive IO latencies, plan for disk/diskgroup replacement. Alternatively, if the cause is high log congestion then prepare for a temporary increase in cluster usage as a result of this disk/diskgroup evacuation, assuming there is enough space in the cluster to host the data evacuated. Wait for the evacuation to complete before removing the “dying” disk from the vSAN cluster. In the latter case the diskgroup should be removed from vSAN and then re-added back to vSAN.
- Impending permanent disk failure, data evacuation failed due to insufficient resources (Health state - Red)
Evacuation failed due to insufficient resources in the cluster. Add the requested capacity into the fault-domain with the dying disk. Evacuation will proceed automatically once the additional resources are added. The active data on the affected disks will stay usable.
- Impending permanent disk failure, data evacuation failed due to inaccessible objects (Health state - Red)
vSAN has evacuated everything that could be evacuated, now all the remaining components on this disk belong to objects that were inaccessible for reasons other than this DDH workflow. Users should examine the remaining data on the disks to decide if it is useful and needs to be recovered with help from VMware GSS, or it can be purged. Many inaccessible object issues are caused by inaccessible swap objects. For more information and help in removing inaccessible objects, refer to the On-disk upgrade concerns – inaccessible swap objects section in the VMware Virtual SAN Diagnostics and Troubleshooting Reference Manual and vSAN Health Service - Data Health – vSAN Object Health (2108319).
Once all the inaccessible objects have been purged or recovered, DDH evacuation will proceed automatically and the disk will transition either into the “data evacuation completed” state or “data evacuation failed” state if sufficient resources were unavailable. If there is delay in resolving this situation, the only risk is losing any useful/important data residing on it especially since the disk could potentially fail permanently. The presence of this disk/diskgroup in the cluster should have no impact on the performance of the rest of the cluster since none of the accessible VMs will have any data left on this unhealthy disk.
- Impending permanent disk failure, data evacuation completed (Health state - Yellow)
This is the disk state when all components required to maintain object accessibility have been evacuated from the disk and remaining components have been marked "absent" by vSAN. The logs in the vsandevicemonitord.log file should help you to determine if the disk was marked unhealthy due to excessive log congestion or I/O latencies. If the diskgroup was marked unhealthy due to excessive log congestion then user should remove it from vSAN cluster and add it back since it should be in a usable state for vSAN after it is added back. On the other hand if the disk was marked unhealthy due to excessive IO latencies it disk should no longer be used for vSAN.
If a "dying" disk belongs to a deduplication enabled diskgroup then the whole diskgroup would be marked unhealthy but after the required data evacuation, the vsandevicemonitord.log file will help you to determine which disks in the diskgroup were observed to have excessive I/O latencies, only those disks need to be replaced. User should collect the output from the vsandevicemonitord.log file, which contains the SMART logging information as well as the high latencies observed by vSAN and send this information to the disk vendor, along with the disk.
Request a Product Feature
To request a new product feature or to provide feedback on a VMware product, please visit the Request a Product Feature page.