Performing a "Reconfigure for VMware HA" operation on the primary ESXi node in an HA cluster triggers false "unexpected virtual machine failover" alerts to be generated
search cancel

Performing a "Reconfigure for VMware HA" operation on the primary ESXi node in an HA cluster triggers false "unexpected virtual machine failover" alerts to be generated

book

Article ID: 318969

calendar_today

Updated On:

Products

VMware vCenter Server

Issue/Introduction

Symptoms:
  • When you perform a Reconfigure for VMware HA operation on the primary node in an HA cluster, an unexpected virtual machine failover alert is triggered for the virtual machines running on that primary node.
  • The vCenter Server events tab displays messages similar to:

    vCenter Server is disconnected from a master HA agent running on host <master hostname> in HA_DRS_Cluster in Datacenter - vSphere HA agent on <master hostname> in cluster HA_DRS_Cluster in Datacenter is disabled

    The vSphere HA availability state of the host <master hostname> in cluster in HA_DRS_Cluster in Datacenter has changed to Uninitialized

    The vSphere HA availability state of the host <slave hostname> in cluster in HA_DRS_Cluster in Datacenter has changed to Election

    vSphere HA unsuccessfully failed over <virtual machine> on <slave hostname> in cluster HA_DRS_Cluster in Datacenter. vSphere HA will retry if the maximum number of attempts has not been exceeded. Reason: The operation is not allowed in the current state.


Environment

VMware vCenter Server 5.0.x
VMware vCenter Server 8.0.x
VMware vCenter Server 6.7.x
VMware vCenter Server 6.0.x
VMware vCenter Server 5.1.x
VMware vCenter Server 7.0.x
VMware vCenter Server 6.5.x
VMware vCenter Server 5.5.x

Cause

When the primary HA host is manually reconfigured for HA, it causes the remaining secondary hosts to enter an election to find a new primary host.

The newly elected primary host places the virtual machines running on the old primary host in an unknown power state and waits up to 10 seconds for notification that the virtual machines on the old primary host are powered on and running.

If the old primary host does not become secondary within that 10-second interval, the new primary host assumes that the virtual machines are down and attempts to restart them. This causes a false failover event to occur, and consequently, the failover task fails because the virtual machines were never powered off. The virtual machines remain unaffected in this scenario.

Resolution

This behavior is an expected cosmetic issue. The alerts can be ignored and cleared once the HA configuration has finished resetting.

To avoid generating alerts, you can increase the monitor period (but this is usually unnecessary):

Notes

  • Starting with vCenter Server 7.0 Update 1, the property name for fdm.policy.unknownStateMonitorPeriod has changed to fdm.unknownStateMonitorPeriod.

When completed, the das.config can be prefixed to these properties, which can apply to all the hosts in the cluster.
  1. In vCenter, right-click the cluster and select Edit Settings.

  2. Click vSphere HA and then Advanced Options.

  3. Add a new option (if not already present)

  • Default Option is 10:

vCenter version

Option

Value

8.0 Update 2 onwards

das.config.fdm.policy.unknownStateMonitorPeriod

10

7.0 Update 1 to 8.0 Update1

das.config.fdm.unknownStateMonitorPeriod

10

Pre-7.0 Update 1

das.config.fdm.policy.unknownStateMonitorPeriod

10

 

  • For this issue, change the value from 10 to 30:

vCenter version

Option

Value

8.0 Update 2 onwards

das.config.fdm.policy.unknownStateMonitorPeriod

30

7.0 Update 1 to 8.0 Update1

das.config.fdm.unknownStateMonitorPeriod

30

Pre-7.0 Update 1

das.config.fdm.policy.unknownStateMonitorPeriod

30

 

  1. Disable and re-enable the HA settings of the cluster.



Additional Information

マスター ノードで VMware HA 操作の再構築を実行すると、予期しない仮想マシンのフェイルオーバーが発生する
在主节点上执行重新配置 VMware HA 操作引发意外虚拟机故障切换

Impact/Risks:
Increasing the monitor period also increases the time to start virtual machine failovers by the same amount (in this case, by 20 seconds) when a primary node stops during an actual HA failure.