Search the VMware Knowledge Base (KB)
View by Article ID

Troubleshooting NSX Edge High Availability (HA) issues (2126560)

  • 1 Ratings

Symptoms

  • High Availability (HA) service is not running
  • Failover to the secondary Edge fails after rebooting the primary
  • Split-brain scenario occurs between the Active and Standby HA Edges
  • VMware NSX Edge logs contains errors similar to:

    2015-04-13T15:22:11+00:00 vse-ahs-1 ha[]: [AHS]:  [daemon.warning] [1919]: WARN: Late heartbeat: Node vse-ahs-1: interval 30760 ms
    2015-04-13T15:22:11+00:00 vse-ahs-1 heartbeat:  [daemon.warning] [1919]: WARN: Late heartbeat: Node vse-ahs-1: interval 30760 ms
    2015-04-13T15:22:11+00:00 vse-ahs-1 ha[]: [AHS]:  [daemon.warning] [1919]: WARN: node vse-ahs-0: is dead
    2015-04-13T15:22:11+00:00 vse-ahs-1 heartbeat:  [daemon.warning] [1919]: WARN: node vse-ahs-0: is dead
    2015-04-13T15:22:11+00:00 vse-ahs-1 ha[]: [AHS]:  [daemon.info] [1919]: info: Link vse-ahs-0:vNic_1 dead.
    2015-04-13T15:22:11+00:00 vse-ahs-1 heartbeat:  [daemon.info] [1919]: info: Link vse-ahs-0:vNic_1 dead.
    2015-04-13T15:22:11+00:00 vse-ahs-1 ha[]: [AHS]:  [daemon.warning] [1919]: WARN: Deadtime value may be too small


    Note: The preceding log excerpts are only examples. Date, time, and environmental variables may vary depending on your environment.

    For more information, see Collecting diagnostic information for VMware NSX Edge (2079380).

Purpose

This article provides information on understanding and troubleshooting High Availability (HA) on VMware NSX for vSphere 6.x Edge.

Overview

High Availability (HA) on NSX for vSphere 6.x Edge

High Availability (HA) ensures that the services provided by NSX Edge appliances are available even when a hardware or software failure renders a single appliance unavailable. Inline with other HA features, such as vSphere HA or MSCS, Edge HA is not designed to deliver zero downtime as the failover between appliances may require some services to be restarted.

NSX Edge HA is designed to minimize failover downtime. For example, it synchronizes the connection tracker of the statefull firewall, or the statefull information held by the load balancer. However, the time required to bring all services back up is not null. Examples of known service restart impacts include a non-zero downtime with dynamic routing when an Edge is operating as a router.
 
In some cases, the two Edge HA appliances are unable to communicate with each other and unilaterally decide to become active. This behavior is expected and is designed to maintain availability of the active Edge services if the standby Edge is unavailable. Further, if the other appliance still exists, as soon as the communication is re-established, the two Edge HA appliances re-negotiate active and standby status. If this negotiation does not finish and if both appliances declare they are active when the connectivity is re-established, an unexpected behavior is observed. This condition, known as split brain, is observed due to these environmental conditions:
  • Physical network connectivity issues, including a network partition.
  • CPU or memory contention on the Edge.
  • Transient storage issues that would cause at least one Edge HA VM to become unavailable. For example, VMware has observed an improvement in Edge HA stability and performance when the VMs were moved off of overprovisioned storage. In particular, during large overnight backups, big spikes in storage latency can impact Edge HA stability.
  • Congestion on the physical or virtual network adapter involved with the exchange of packets.

In addition to environmental issues, a split-brain condition is observed when the HA configuration engine falls into a bad state or when the HA daemon fails.

Stateful High Availability

The primary NSX Edge appliance is in the active state and the secondary appliance is in the standby state. NSX Manager replicates the configuration of the primary appliance for the standby appliance or add two appliances manually. VMware recommends that you create the primary and secondary appliances on separate resource pools and datastores. If you create the primary and secondary appliances on the same datastore, the datastore is shared across all hosts in the cluster for the HA appliance pair to be deployed on different ESXi hosts. If the datastore is a local storage, both virtual machines are deployed on the same host.

All NSX Edge services run on the active appliance. The primary appliance maintains a heartbeat with the standby appliance and sends service updates through an internal interface.

If a heartbeat is not received from the primary appliance within the specified time (default value is 15 seconds), the primary appliance is declared dead. The standby appliance moves to the active state, takes over the interface configuration of the primary appliance, and starts the NSX Edge services that were running on the primary appliance. When the switch over takes place, a system event displays in the System Events tab of Settings & Reports. Load Balancer and VPN services re-establishes TCP connection with NSX Edge, so service is disrupted for a short while. Logical switch connections and firewall sessions are synched between the primary and standby appliances, so there is no service disruption during switch over.

If the NSX Edge appliance fails and a bad state is reported, HA force syncs the failed appliance in order to revive it. When revived, it takes on the configuration of the now-active appliance and stays in a standby state. If the NSX Edge appliance is dead, delete the appliance and add a new one.

NSX Edge ensures that the two HA NSX Edge virtual machines are not on the same ESX host even after you use DRS and vMotion (unless you manually vMotion them to the same host). Two virtual machines are deployed on vCenter in the same resource pool and datastore as the appliance you configured. Local link IPs are assigned to HA virtual machines in the NSX Edge HA so that they can communicate with each other. You can specify management IP addresses to override the local links.

For more information, see the About High Availability section of the NSX Administration Guide. Also, see the Stateful Active/Standby HA Model section of the VMware NSX for vSphere Network Virtualization Design Guide.

High Availability Enhancements in NSX for vSphere 6.2.4

The robustness of High Availability in this release has been improved with the these changes:

NSX Manager:

  • Reserve CPU/Memory resources for the NSX Edge.
  • Support for dedicated High Availability interface.

NSX Edge:

  • Bidirectional Forwarding Detection (BFD) for heatbeat message exchange/cluster membership.
  • New state machine for HA role election.

Resolution

Validate that each troubleshooting step is true for your environment. Each step provides instructions or a link to a document to eliminate possible causes and take corrective action as necessary. The steps are ordered in the most appropriate sequence to isolate the issue and identify the proper resolution. Do not skip a step.
  1. Check the release notes for current releases to see if the issue is resolved in a bug fix. For more information, see the VMware NSX for vSphere Documentation page.
  2. Verify that at least one internal interface is configured when deploying the Edge as this is required before enabling HA.
  3. Verify that the health status of both Active and Standby Edge is showing normal by running the show service highavailability command.

    For example:

    EDGE-1-0> show service highavailability

    Highavailability Status: running
    Highavailability Unit Name: edge-1-0
    Highavailability Unit State: active
    Highavailability Interface(s): vNic_1
    Unit Poll Policy:
       Frequency: 3 seconds
       Deadtime: 15 seconds
       Stateful Sync-up Time: 10 seconds
    Highavailability Healthcheck Status:
       Peer host [edge-1-1 ]: good
       This host [edge-1-0 ]: good
    Highavailability Stateful Logical Status:
       File-Sync running
       Connection-Sync running
          xmit xerr rcv rerr
          21612 0 13920 0

         
  4. Find the peer Edge HA IP address with the show service highavailability command.
  5. Verify network connectivity by pinging the peer HA IP address with the ping <peer_ha_ip> command. If it is unreachable, it is the network connectivity issue which needs to be resolved first. If it is reachable, there may be an intermittent network connectivity issue previously. For more information, see Split-brain scenario on NSX/vShield Edge configured for High Availability (HA) (2117922)

    Note: Consider decreasing the Declare Dead Time settings from the default value of 15 seconds. If a heartbeat is not received from the active Edge within the specified time, the active edge is declared dead. The standby edge then moves to the active state, takes over the interface configuration of the primary appliance, and starts the NSX Edge services that were running on the primary appliance. When the switch over takes place, a system event is displayed in the System Events tab of Settings & Reports. For more information, see Virtual machines lose network connectivity when redeploying a High Availability (HA) enabled VMware vCloud Networking and Security or NSX for vSphere Edge (2135309).

    To configure heartbeat settings:

    1. Log in to the vSphere Web Client.
    2. Click Networking & Security and then click NSX Edges.
    3. Double-click an NSX Edge.
    4. Click the Manage tab and then click the Settings tab.
    5. In the HA Configuration panel, click Change.
    6. In the Change HA Configuration dialog box, enter the Declare Dead Time. The default is 15 seconds.
    7. Click OK.

  6. Collect CPU, memory, network, and/or storage utilization details for both the NSX Edge VMs and ESXi hosts.
  7. Verify if heartbeats are being sent/received by running the debug packet display interface <ha-interface> udp_port_694 command.
  8. Verify the HA daemon internal status by running the show service highavailability internal command.
    Note: This command is supported in VMware NSX for vSphere 6.2.x. For versions prior to 6.2, run the crm-mon-1 command from the root shell.
  9. Collect the NSX Edge Tech Support Logs. For more information, see Collecting diagnostic information for VMware NSX Edge (2079380).

    Note: The preceding commands should be captured on the console of both NSX Edge VMs.

If your issue still exists after trying the steps in this article, see:

Additional Information

Using Log Messages to Troubleshoot NSX Edge HA issues

This example log sequence illustrate the automated recovery procedure from a temporary split brain condition. The HA event begins at 2016-01-16T12:21 and finishes within four seconds.

2016-01-16T12:21:44+00:00 vShield-edge-3-1 ha[]: [default]:  [daemon.warning] [1730]: WARN: node vshield-edge-3-0: is dead
2016-01-16T12:21:44+00:00 vShield-edge-3-1 heartbeat:  [daemon.warning] [1730]: WARN: node vshield-edge-3-0: is dead
2016-01-16T12:21:48+00:00 vShield-edge-3-1 crmd:  [daemon.info] [1953]: info: crm_update_peer_proc: vshield-edge-3-0.ais is now online, 


In this example log sequence, Edge2-0 loses heartbeats with Edge2-1. However, the condition recovers automatically within two seconds.

2016-01-16T12:50:53+00:00 vShield-edge-2-0 ha[]: [default]:  [daemon.warning] [1781]: WARN: node vshield-edge-2-1: is dead
2016-01-16T12:50:53+00:00 vShield-edge-2-0 crmd:  [daemon.info] [1930]: info: crm_update_peer_proc: vshield-edge-2-1.ais is now offline
2016-01-16T12:50:55+00:00 vShield-edge-2-0 ha[]: [default]:  [daemon.crit] [1781]: CRIT: Cluster node vshield-edge-2-1 returning after partition.


The logging is provided in INFO mode, with details in DEBUG mode.
  • Split Brain Detection and Confirmation

    LOGGER.debug("Attempting Split Brain recovery for edge {}, config version {}.", edgeId, configVersion);

    LOGGER.debug("Confirmed Split Brain for edge {} with vm0 state as {} and vm1 state as {}", edgeId, haState0, haState1);
    LOGGER.info("Split Brain confirmed for edge {}", edgeId); 
    OR
    LOGGER.debug("No Split Brain for edge {} with vm0 state as {} and vm1 state as {}", edgeId, haState0, haState1);
    LOGGER.info("No Split Brain observed for edge {}. Returning from HaAutoHealTask without further action.", edgeId);


  • Split Brain Recovery Attempt

    EdgeEventCodes.HA_STATE_SPLIT_BRAIN,
    logMessage = "System event: Attempt Split Brain recovery on edge id " + edgeId + ". AutoHeal Counter :" + ct.getAutoHealCount();

    LOGGER.info("Configuring deadTimeInterval to {} seconds to restart HA daemon and recover from Split Brain for edge {}", deadTimeInterval, edge);


    EdgeEventCodes.HA_STATE_SPLIT_BRAIN_RECOVERY_ATTEMPT,
    String logMessage = "Recovery from Split Brain for vShield Edge " + edgeId + " attempted with count "+ attempt;


    Split Brain Recovery not Attempted

    LOGGER.info("Current edge version {} is not APPLIED. Not attempting HA Split Brain recovery.", edge.getEdgeVersion());

  • Split Brain Recovered

    EdgeEventCodes.HA_STATE_SPLIT_BRAIN_RECOVERED,
    String logMessage = "System event: Split Brain recovered on edge id " + edgeId + ". AutoHeal Counter :" + ct.getAutoHealCount();


  • Split Brain after recovery (with N recovery attempts, following which Events are raised in back off mode)

    EdgeEventCodes.HA_STATE_SPLIT_BRAIN,
    logMessage = "Raising system event for Split Brain on edge id " + edgeId + ". Event Counter :" + ct.getEventCount();


  • Reset Time Interval to re-attempt Split Brain Recovery

    LOGGER.info("Split Brain Recovery time interval elapsed. Reset the counter for edge {}.", edgeId);
  • Message you should watch for that may lead to failover

    WARN: Late heartbeat: - heartbeat is not returned in the heartbeat interval
  • Message you should watch for when Heartbeats are lost completely

    WARN: Deadtime value may be too small

Note: The preceding log excerpts are only examples. Date, time, and environmental variables may vary depending on your environment.

See Also

Request a Product Feature

To request a new product feature or to provide feedback on a VMware product, please visit the Request a Product Feature page.

Feedback

  • 1 Ratings

Did this article help you?
This article resolved my issue.
This article did not resolve my issue.
This article helped but additional information was required to resolve my issue.

What can we do to improve this information? (4000 or fewer characters)




Please enter the Captcha code before clicking Submit.
  • 1 Ratings
Actions
KB: