"vSphere HA virtual machine failed to failover" error in vCenter Server

Products

VMware vCenter Server VMware vSphere ESXi

Issue/Introduction

This article provides information to:

Clear the vSphere HA virtual machine failed to failover error from the virtual machine.
Deal with the vSphere HA virtual machine failed to failover error if occurs.
Reduce the occurrence of the vSphere HA virtual machine failed to failover error.

Symptoms:

In a cluster with an isolation response set to Leave powered on when a host becomes isolated may display this error on a virtual machine.

vSphere HA virtual machine failed to failover
The virtual machine continues to run without a problem.

Environment

VMware vCenter Server 6.x
VMware vCenter Server 7.x
VMware vCenter Server 8.x
VMware vSphere 6.x
VMware vSphere 7.x
VMware vSphere 8.x

Cause

This behavior can occur whenever a High Availability primary agent declares a host dead. However, the virtual machines continue to run without incident. This alarm does not mean HA has failed or stopped working. When this alarm is triggered, it means that one or more virtual machines failed to get powered on by a host in a cluster protected by HA.

Possible reasons for this to happen:

The host is still running but has disconnected from the network. The cluster's host isolation response is set to Leave powered on:
- When a host becomes network isolated, the remaining hosts in the cluster do not know if the host has crashed, or is just disconnected from the network. As a result, the remaining hosts attempt to power up the virtual machines that were last logged as running on the isolated host. With Leave powered on, the host that became network isolated will leave the virtual machines up and running and not attempt to power them down, thus keeping the locks on the files. With the isolated host locking the files, the remaining hosts will fail to perform the power on the task on the virtual machines resulting in the alarm triggering.
The host is still running but has disconnected from the network. The cluster's host isolation response is set to Shut down or Power off:
- With this host isolation response, a host will attempt to send shut down or power off commands to its running virtual machines when it recognizes it is isolated. Once a virtual machine is completely shut down, and the original isolated host no longer has locks on the virtual machine's files, the remaining hosts in the cluster will be able to obtain the locks necessary to power up the virtual machines. If the virtual machine is not successfully shut down, or the locks are not released, then the alarm will be a trigger.
The host has failed and the virtual machine storage is in a degraded state. The remaining hosts in the cluster cannot contact the storage device and fail to power up the virtual machines, resulting in the alarm triggering.

Resolution

This is expected behavior in VMware vCenter Server because the virtual machines continue to run without incident, you can safely ignore this issue.

Workaround:
To clear the alarm from the virtual machine:

Acknowledge the alarm in the Monitor tab.
1. Select an inventory object in the object navigator.
2. Click the Monitor tab.
3. Click Issues and Alarms, and click Triggered Alarms.
4. Select an alarm and select Acknowledge.

Note: If this alarm is on multiple virtual machines, you may select the host, cluster, data center, or vCenter Server object in the left pane and continue with step 2 to clear the alarms with fewer steps.
For more information on dealing with alerts, see:

vCenter Server 7.x - the Acknowledge Triggered Alarms in the vSphere Client section in the document vSphere Monitoring and Performance

To reduce the likelihood of this issue occurring:

Use multiple management networks. For more information, see Best Practices for VMware vSphere® High Availability Clusters
Ensure the datastore heartbeats within the vCenter Server are communicating properly for HA to run efficiently when management network problems occur.

For example, if using SAN and IP-based storage, mount a couple of SAN-based datastores to the hosts in the cluster so that HA may use them instead of IP-based storage. Or, if only IP-based storage is used, consider fault isolating one or more of the networks used for storage from those used for the management network.