vSAN Health Service - Capacity utilization

Products

VMware vSAN

Issue/Introduction

This article explains the Capacity Utilization Health – Disk Space check in the vSAN Health Service and provides details on why it might report an error.

Environment

VMware vSAN 6.7.x
VMware vSAN 8.0.x
VMware vSAN 7.0.x

Resolution

Q: What does the Capacity utilization health – disk space check do?

This particular health check looks at cluster level disk usage and ensures it doesn't exceed the threshold.

Q: What does it mean when it is in an error state?

For vSAN, If this check has a warning or error, it may means there is no enough free storage capacity and the deployment of new virtual machines may fail, or any component rebuild operations may not be allowed. If the cluster is running out of space, like exceeding 99% full, the running virtual machines may crash with pending question and any R/W IO may fail.

For vSAN Direct, if this check has a warning or error, it may mean there is no enough free storage capacity, and the creation of CNS volumes may fail.

Thresholds definition

On this page, vSAN ops threshold and host rebuild threshold are defined as below:

vSAN ops threshold = Total space - Operations reserve
Host rebuild threshold = Total space - Operations reserve - Host rebuild reserve

* Note that in some cases the Host rebuild threshold can be equal to the vSAN ops threshold, such as in VMC deployments, because the feature of host rebuild reserve is not needed for VMC deployments.

The thresholds definition is:

When host rebuild reserve is disabled:

Red: MIN(90% of total capacity, vSAN ops threshold)
Yellow: = MIN(70% of total capacity, Host rebuild threshold)

When host rebuild reserve is enabled:

Red: MIN(90% of total capacity, Host rebuild threshold)
Yellow: = MIN(70% of total capacity, 80% of Host rebuild threshold)

Q: How does one troubleshoot and fix the error state?

The first step is to ensure that all the storage is valid, that there are no missing capacity devices, and ensure that the vSAN datastore capacity is what you expect it to be. There a 3 options to recover a full vSAN datastore situation as detailed below:

Download VM(s) to local workstation to free up space on vSAN datastore:
1. Identify the full disk/diskgroup.
2. Identify VM(s) residing on the full diskgroup.
  1. VM must be healthy. If they are orphaned or inaccessible, then select a different VM to free up space.
  2. VM must have namespace reservation usage of less than 10MB.
3. Power off [for VMs with a pending question, answer 'cancel'].
  1. There may be multiple questions - all must be answered.
  2. If hostd is non-responsive/hung, this method may not work.
4. Download the VM to your local workstation.
  1. verify download is complete and contains all VM data.
5. Delete the VM on vSAN.
Power-off VMs to free up resources on vSAN:
1. Gracefully power off VM(s). This will delete the swap files associated with those guests and free up some space.
2. Once enough VMs have been powered off, svMotion VMs to alternate storage.
3. If alternate storage is not available, delete unnecessary guests to free up additional space.
Expand vSAN Direct datastore capacity.
If vSAN servers have available slots for additional drives and disks are available, add disks to the vSAN Direct storage to expand capacity. This is typically the easiest option if the appropriate resources are available.

Caveats

vSAN has limited functionality when full.
- When diskgroups in the cluster are full, VMs that reside on those diskgroups will experience failures for any operations that require space, such as storage vMotion. Other operations may take significantly more time to complete than they would in a healthy cluster.
Recovered space may not be reflected in the UI.
- When pursuing methods of recovery that delete swap files above, manually confirm via CLI.