vSAN "Proactive rebalance" and "Automatic Rebalance"
search cancel

vSAN "Proactive rebalance" and "Automatic Rebalance"

book

Article ID: 319926

calendar_today

Updated On:

Products

VMware vSAN

Issue/Introduction

The purpose of this article is to explain vSAN's "Proactive Rebalance and Automatic Rebalance " when it may be applicable.
 
If disks report errors in the Skyline health check, indicating that the cluster is imbalanced and there are disks that are high on space usage while others are very low, you may need to run proactive/automatic rebalance to distribute the load across disk for space usage  based on the vSAN version.
 

vSAN offers two basic forms of rebalancing:

  • Reactive Rebalancing. This occurs when vSAN detects any storage device that is at 80% capacity or above utilization and will attempt to move some of the data to other devices that fall below this threshold. A more appropriate name for this might be “Capacity Constrained Rebalancing.” This feature has always been an automated, non-adjustable capability. 
    • Note: If all disks are greater than 80% utilized, the Reactive Rebalance will not run.
  • Proactive Rebalancing. This occurs when vSAN detects any storage device is consuming a disproportionate amount of its capacity in comparison to other devices. By default, vSAN looks for any device that shows a delta of 30% or greater capacity usage than any other device. A more suitable name for this might be “Capacity Symmetry Rebalancing.”

Proactive Rebalance: This manually initiates a rebalance of the objects in a vSAN cluster through vSAN Health plugin on vCenter GUI or through RVC console. This is only supported in vSAN 6.7 U2 and older. 

 Automatic Rebalance: Manually triggering proactive rebalance is not required from vSAN 6.7 U3 later. You can automate all rebalancing activities with cluster-wide configuration and threshold settings.


Symptoms:
The nature of a distributed storage system means that data will be spread across participating nodes. vSAN manages all of this for you. Its cluster-level object manager is not only responsible for the initial placement of data, but ongoing adjustments to ensure that the data continues to adhere to the prescribed storage policy.

Environment

VMware vSAN 8.0.x
VMware vSAN 7.0.x
VMware vSAN 6.x

Cause

Data can become imbalanced for many reasons: Storage policy changes, host or disk group evacuations, adding hosts, object repairs, or overall data growth.

Resolution

Proactive Rebalance: Running a manual rebalance may be necessary when your vSAN cluster is imbalanced. This operation moves components from the over-utilized disks to the under-utilized disks.  When performing a manual rebalance, this operation runs for 24 hours and then stops.
 

Note: Running a manual rebalance utilizes some system resources and this process can take several hours to complete.  This depends on the number objects that needs to be rebalanced to reduce disk usage variance across cluster.
It is recommended to run Proactive rebalance when there is minimal workload by monitoring vSAN performance charts.

To run a Proactive Rebalance in vSphere 6.7 U2 and lower:
  1. Navigate to the vSAN cluster in the vSphere Web Client.
  2. Click the Monitor tab and click vSAN.
  3. Click Health.
  4. In the vSAN health service table, select Warning: Virtual SAN Disk Balance. You can review the disk balance of the hosts.
  5. Click the Rebalance Disks button to rebalance your cluster.
    Note: This task may take many hours.
--------------------------------------------------------------------------------------------------------------------

Automatic Rebalance:
Starting in vSAN 6.7 U3, disk rebalancing is no longer a manual method, and it needs to be enabled as a service within vSAN cluster settings (explained below). If this is not enabled, vSAN will only initiate rebalance on vSAN disks when any of the vSAN disks crosses 80% Capacity threshold.

Note: Disk rebalancing can impact the I/O performance of your vSAN cluster. To avoid this performance impact, you can alter threshold value or turn off automatic rebalance when peak performance is required.
Procedure  to configure Automatic Rebalance:
1. Navigate to the vSAN cluster.
2. Click the Configure tab.
3. Under vSAN, select Services.
4. Click to edit Advanced Options.
5. Click to enable or disable Automatic Rebalance.
6. Set the variance threshold to any percentage from 20 to 75 as per your requirement.
 
The threshold to initiate rebalance is set to 30% by default, which means that if any two disks have this variance (one is 30% more loaded than the other), rebalancing of components begins. Rebalancing will continue until the variance reaches half of the set threshold value, i.e., 15% by default (or until Automatic Rebalance is disabled).

There is also a health check for vSAN Disk Balance, where you can see disk usage details of the vSAN cluster. If Automatic Rebalance is enabled, vSAN automatically tries to keep this health check green. If it is disabled, this health check is triggered and will requires the admin to manually trigger a Rebalance Disks task, or re-enable Automatic Rebalance.

The toggle for enabling or disabling this cluster-level feature can be found in vCenter, under Configure > vSAN > Services > Advanced options > “Automatic Rebalance” as shown in Figure 1.
image.png

RECOMMENDATION: Keep the “Rebalancing Threshold %” entry to the default value of 30. Decreasing this value could increase the amount of resynchronization traffic and cause unnecessary rebalancing for no functional benefit.

The “vSAN Disk Balance” health check was also changed to accommodate this new capability. If vSAN detects an imbalance that meets or exceeds a threshold while automatic rebalance is disabled, it will provide the ability to enable the automatic rebalancing, as shown in Figure 2. The less-sophisticated manual rebalance operation is no longer available.

image.png

Once the Automatic Rebalance feature is enabled, the health check alarm for this balancing will no longer trigger and rebalance activity will occur automatically.
 

Should Automatic Rebalancing Be Enabled?

Yes, it is recommended to enable the automatic rebalancing feature on your vSAN clusters. When the feature was added in 6.7 U3, VMware wanted to introduce the capability slowly to customer environments and remains this way in vSAN 7. With the optimizations made to our scheduler and resynchronizations in recent editions, the feature will likely end up enabled by default at some point.

There may be a few rare cases in which one might want to temporarily disable automatic rebalancing on the cluster. Adding a large number of additional hosts to an existing cluster in a short amount of time might be one of those possibilities, as well as perhaps nested lab environments that are used for basic testing. In most cases, automatic rebalancing should be enabled.

Links
Document: https://docs.vmware.com/en/VMware-vSphere/6.5/com.vmware.vsphere.virtualsan.doc/GUID-968C05CA-FE2C-45F7-A011-51F5B53BCBF9.html
Document:  https://docs.vmware.com/en/VMware-vSphere/6.7/com.vmware.vsphere.vsan-monitoring.doc/GUID-968C05CA-FE2C-45F7-A011-51F5B53BCBF9.html
Release Notes: https://docs.vmware.com/en/VMware-vSphere/6.7/rn/vmware-vsan-67u3-release-notes.html
 

 

 


Workaround:
Unfortunately, if the automatic rebalance is not working, this means that vSAN is not finding the way to rebalance the objects without fulling other disks. Take into consideration that is not only moving data, vSAN needs to ensure the accessibility of the objects.
 
Some options to validate are the following:
  • Clean up space, look for VMs that are not in use or big files that can be deleted.
  • Add more capacity disks to the hosts.
  • Look up for VMs that are not critical or used for test and change the policy to RAID 0.
  • Check if some VMs with default policy can be changed to RAID 5 or RAID 6 erasure coding.
  • You can enable dedup and compression on the cluster to eliminate duplicate data.


Additional Information

To run the Proactive Rebalance  in releases using RVC (Deprecated):
  1. Log into the Ruby vSphere Console (RVC).
  2. Change to the computers namespace.
  3. To see how much data needs to be rebalanced, Run this command on your vSAN cluster:
    vsan.proactive_rebalance_info <vSAN-cluster-number, or "." for current rvc path location>
The output from this will appear like this:
/localhost/Test-DC/computers/Test-CL> vsan.proactive_rebalance_info .
2019-08-16 19:31:08 +0000: Retrieving proactive rebalance information from host esxi-3.labs.org ...
2019-08-16 19:31:08 +0000: Retrieving proactive rebalance information from host esxi-1.labs.org ...
2019-08-16 19:31:08 +0000: Retrieving proactive rebalance information from host esxi-2.labs.org ...
2019-08-16 19:31:09 +0000: Fetching vSAN disk info from esxi-3.labs.org (may take a moment) ...
2019-08-16 19:31:09 +0000: Fetching vSAN disk info from esxi-2.labs.org (may take a moment) ...
2019-08-16 19:31:09 +0000: Fetching vSAN disk info from esxi-1.labs.org (may take a moment) ...
2019-08-16 19:31:10 +0000: Done fetching vSAN disk infos

Proactive rebalance start: 2019-08-16 19:30:47 UTC
Proactive rebalance stop: 2019-08-17 19:30:54 UTC
Max usage difference triggering rebalancing: 30.00%
Average disk usage: 56.00%
Maximum disk usage: 63.00% (17.00% above minimum disk usage)
Imbalance index: 10.00%

No disk detected to be rebalanced

You'll notice that this rebalance starts and stops in 24hrs.
  • To start the rebalance, run this command:
    vsan.proactive_rebalance -s <vSAN-cluster-number>
The output will look like this:
/localhost/Test-DC/computers/Test-CL> vsan.proactive_rebalance . -s
2019-08-16 19:30:55 +0000: Processing vSAN proactive rebalance on host esxi-3.labs.org ...
2019-08-16 19:30:55 +0000: Processing vSAN proactive rebalance on host esxi-1.labs.org ...
2019-08-16 19:30:55 +0000: Processing vSAN proactive rebalance on host esxi-2.labs.org ...

Proactive rebalance has been started!
  • Monitor the status of the rebalance using this command:
    vsan.proactive_rebalance_info <vSAN-cluster-number>

    Note: This task may take many hours.
To run a rebalance beyond the default 24hrs, you will need to change the run times of the rebalance, <Value is in units of Seconds>.

For example, setting the rebalance to run for a week:
vsan.proactive_rebalance . -t 604800

In which case this operation will run to completion or a week.  If the rebalance finishes before the week is up, the process ends.
 
In vSAN 6.5 , the rvc console for the vsan.proactive_rebalance flags which can be used:

vsan.proactive_rebalance . -h
usage: proactive_rebalance [opts] cluster
Configure proactive rebalance for vSAN
  cluster: Path to ClusterComputeResource
-s, --start            Start proactive rebalance
-t, --time-span=<i>   Determine how long this proactive rebalance lasts in seconds, only be valid when option 'start' is specified
-v, --variance-threshold=<f>    Configure the threshold, that only if disk's used_capacity/disk_capacity exceeds this threshold(comparing to the disk with the least fullness in the cluster), disk is qualified for proactive rebalance, only be valid when option 'start' is specified
  -i, --time-threshold=<i>   Threshold in seconds, that only when variance threshold continuously exceeds this threshold, corresponding disk will be involved to proactive rebalance, only be valid when option 'start' is specified
  -r, --rate-threshold=<i>    Determine how many data in MB could be moved per hour for each node, only be valid when option 'start' is specified
  -o, --stop                      Stop proactive rebalance
  -h, --help                      Show this message

-----------------------------------------------------------------------

For more information about:
 
vSAN のプロアクティブなリバランス
简体中文:vSAN 主动再平衡

Impact/Risks:
vSAN’s built-in logic is designed to take a conservative approach when it comes to rebalancing. It wants to avoid moving data unnecessarily. This would consume resources during the resynchronization process and may result in no material improvement. Like DRS in vSphere, the goal of vSAN’s rebalancing is not to strive for perfect symmetry of capacity or load across hosts, but to adjust data placement to reduce the potential of contention of resources. Accessing balanced data will result in better performance as it reduces the potential of reduced performance due to resource contention.