Heavy resync traffic may cause VM IO performance degradation

Products

VMware vSAN

Issue/Introduction

Symptoms:

There are multiple scenarios around hardware failure modes and a few workflows in vSAN which could cause Resync/repair to ensure VM accessibility.

Typical scenarios and workflows are:

• One or more node or disk failures

• Node or disk evacuation

• VM storage policy reconfiguration

• Cluster rebalancing in case disks are greater than 80% full

• Upgrade scenarios like disk format upgrade and enabling deduplication and compression

vSAN is using a congestion algorithm that first delays resync traffic before VM IO traffic is also delayed. However, VM IO might still be impacted in the following cases:

If VM I/O is low compared to resync, VM I/O could become starved by the resync traffic and incur delay.
If VM I/O and resync traffic are high, then the congestion algorithm would first impact resyncs, but this might not be enough to improve destaging at LSOM at which point additional build-up of VM IO could kick congestion for VM traffic causing latency increase in the VMs.

Environment

VMware vSAN 6.2.x
VMware vSAN 6.5.x
VMware vSAN 6.6.x
VMware vSAN 6.0.x
VMware vSAN 5.5.x
VMware vSAN 6.1.x

Resolution

VM I/O performance degradation:

Starting with vSAN 6.7x we introduced a new esxcli option called esxcli vsan resync. This allows us to have more control over resync monitoring/throttling at the host level of a vSAN node without having to rely on RVC or UI

Open a SSH to the ESXi server in question and execute

To validate current value
- vsan resync throttle get > Get information about vSAN resync throttling
To modify current value
- vsan resync throttle set -level (Set vSAN resync throttle level in Mbps (integer in the range 0-512, 0 means no throttling) (required))
Example output
- esxcli vsan resync throttle set --level <0-512mb>

Note: These changes are applicable per host and not per cluster as in previous builds. No reboot is required for the changes to take effect

If the resync process is extremely slow, it is possible that bandwidth for resync traffic is being reduced due to resync throttling or heavy VM I/O on the system. Resync speed can be increased by reducing VM I/O and tuning throttling appropriately to balance VM I/O and Resync traffic. The other primary cause of slowness during resync operation is disk bottlenecking.

If a resync operation is causing a performance impact on the VM's in the cluster and throttling is disabled (as it is by default), the next step is to collect a performance data sample with Verbose and Network Diags via Perf Services for versions 6.7 and higher, and analyze the data to determine where the throughput bottleneck or latency is being introduced. More information on this process can be found in the below documentation.

How to use and interpret performance statistics collected using vSAN Observer (2064240)
Configure vSAN Performance Service

Additional Information

vSAN esxcli Commands

Impact/Risks:
If resync operations are allowed to flow with too much bandwidth, some environments may experience congestion, as a result, depending on the current workload and objects that need to be reprotected/rebuilt