vGPU Virtual Machine automated migration for Host Maintenance Mode in a DRS Cluster
search cancel

vGPU Virtual Machine automated migration for Host Maintenance Mode in a DRS Cluster

book

Article ID: 318908

calendar_today

Updated On:

Products

VMware vSphere ESXi

Issue/Introduction

Symptoms:

When a ESXi Host is put into Maintenance Mode, and the ESXi Host is a member of a DRS Cluster, DRS can migrate Virtual Machines automatically to other compatible hosts in the Cluster. If a ESXi Host with vGPU Virtual Machines is put into Maintenance Mode, the “Enter maintenance mode” task will not complete with failure events:
 
DRS failed to generate a vMotion recommendation for a virtual machine on a host entering Maintenance Mode.”
 
vGPU Virtual Machines are not automatically migrated by DRS when a ESXi Host enters Maintenance Mode, due to workload disruption from long Virtual Machine Stun Times. The Virtual Infrastructure Admin will need to manually remediate by explicitly migrating the ESXi Host’s vGPU Virtual Machines.

For more information about vMotion and Virtual Machine Stun Time see the following documentation:
Using vMotion to Migrate vGPU Virtual Machines
The vMotion Process Under the Hood - VMware vSphere Blog
Virtual Machine Conditions and Limitations for vSphere vMotion


Environment

VMware vSphere 6.7.x
VMware vSphere ESXi 8.0
VMware vSphere ESXi 8.0.2
VMware vSphere ESXi 7.0.3
VMware vSphere 7.0.x
VMware vSphere ESXi 8.0.1

Cause

vGPU architecture has long Virtual Machine Stun Times.

Resolution

Starting with vSphere 8.0 U2, DRS can estimate the Stun Time for a given vGPU VM configuration. When the DRS Cluster Advanced Options are set and the Estimated VM Devices Stun Time for a VM is lower than the VM Devices vMotion Stun Time limit, DRS will automate VM migrations.
  
To enable this functionality, make sure your infrastructure meets the following requirements:

* Healthy vSphere Lifecycle Services (Refer to: https://kb.vmware.com/s/article/91891)
* Configuration of the VM's vGPU devices through the VCenter UI only
* Healthy vMotion network (Example: vMotion NICs setup through Cluster QuickStart)

Then add the following DRS Cluster Advanced Options:
 
Option: PassthroughDrsAutomation
Value: 1

For vGPU VMs with Stun Times exceeding the "vMotion Stun Time Limit" (default 100 seconds), a VI Admin can add the following DRS Cluster Advanced Option:

Option: VmDevicesStunTimeTolerated
Value: <number of seconds, greater than any VM's Estimated Stun Time in the Cluster> (Default 100 seconds)

OR

Modify the "vMotion Stun Time Limit" in the VM's Configuration -> "VM Options" Tab -> "Advanced" Section

If needed, the Workaround below will allow evacuation even during vMotion network health degradation.


Workaround:

With vCenter Server 7.0 Update 3f and vSphere 7.0.3 or newer, a DRS Cluster Advanced Options override was added to provide Virtual Infrastructure Admins a way to OPT-IN to automated evacuation of vGPU Virtual Machines:

Option: VgpuMMAutomationTimeoutSecs

Value: -1

The above override comes with the following behavior changes:

  • Evacuation of vGPU Virtual Machines is automated, subject to the 100 second vMotion timeout.

  • During Switchover a vGPU Virtual Machines Stun Time may exceed 10 seconds (dependent on both network bandwidth and the size of the vGPU profile).

  • Evacuation of Virtual Machines is serialized to avoid network contention.

Requirements:

  • Extra vGPU host capacity in the DRS cluster (Example: duplicate host configuration for the host going into Maintenance Mode).

  • No compatibility issues reported for the VMs on the host going into Maintenance Mode.