Search the VMware Knowledge Base (KB)
View by Article ID

Troubleshooting NSX for vSphere 6.x Controllers (2125767)

  • 6 Ratings

Symptoms

  • Deployment of NSX Controller(s) fails.
  • NSX Controller fails to join the cluster.
  • Running the show control-cluster status command shows the Majority status flapping between Connected to cluster majority to Interrupted connection to cluster majority.

Purpose

This article provides information on identifying cause for NSX Controller failure and troubleshooting VMware NSX for vSphere 6.x controllers.

Resolution

IMPORTANT: Please note that this knowledge base article is no longer being updated. For the most up-to-date information, see the latest version of the NSX Troubleshooting Guide.

Validate that each troubleshooting step is true for your environment. Each step provides instructions or a link to a document to eliminate possible causes and take corrective action as necessary. The steps are ordered in the most appropriate sequence to isolate the issue and identify the proper resolution.

Installation and deployment issues:

  • Verify that there are at least three Controller nodes deployed in a cluster. VMware recommends to leverage the native vSphere anti-affinity rules to avoid deploying more than one Controller node on the same ESXi host.
  • Verify that all NSX Controllers display a Connected status. If any of the Controller nodes display a Disconnected status, ensure that the following are consistent by running this command show control-cluster status on all Controller nodes:

    Type Status
    Join status Join complete
    Majority status Connected to cluster majority
    Cluster ID Same information on all Controller nodes

  • In addition, ensure that all Roles are consistent on all Controller nodes:

    Role Configured status Active status
    api_provider enabled activated
    persistence_server enabled activated
    switch_manager enabled activated
    logical_manager enabled activated
    directory_server enabled activated


  • Verify that vnet-controller process is running. Run the show process command on all Controller nodes and ensure that java-dir-server service is running.
  • Verify the System status and resource utilization for each Controller. Run the show status command ensuring load is optimal for all nodes.
  • Verify the cluster history and ensure there is no sign of host connection flapping, or VNI join failures and abnormal cluster membership change. To verify this, run the show control-cluster history command.
  • Verify that VXLAN Network Identifier (VNI) is configured. For more information, see the VXLAN Preparation Steps section of the VMware VXLAN Deployment Guide.
  • Verify that SSL is enabled on the Controller cluster. Run the show log cloudnet/cloudnet_java-vnet-controller*.log filtered-by sslEnabled command on each of the Controller nodes.

Host connectivity issues:

  • Check for host connectivity errors.

    Run the show log cloudnet/cloudnet_java-vnet-controller*.log filtered-by host_IP command on each of the Controller nodes.
  • Check for any abnormal error statistics.

    Run these commands on each of the Controller nodes:

    •     show control-cluster core stats: overall stats
    •     show control-cluster core stats-sample: latest stats samples
    •     show control-cluster core connection-stats ip: per connection stats
       
  • Verify the logical switch/router message statistics or high message rate.

    Run these commands on each of the Controller nodes:

    •     show control-cluster logical-switches stats
    •     show control-cluster logical-routers stats
    •     show control-cluster logical-switches stats-sample
    •     show control-cluster logical-routers stats-sample
    •     show control-cluster logical-switches vni-stats vni
    •     show control-cluster logical-switches vni-stats-sample vni
    •     show control-cluster logical-switches connection-stats ip
    •     show control-cluster logical-routers connection-stats ip

    For more information, see the NSX Command Line Interface Reference Guide.

  • NSX for vSphere release 6.2.4 extends the Central CLI for NSX troubleshooting with the show host hostID health-status command to check the health status of hosts in your prepared clusters. For Controller troubleshooting, these health checks are supported:
    • Check whether the net-config-by-vsm.xml is synchronized to Controller list. 
    • Check if there is a socket connection to Controller.
    • Check whether the VNI is created and whether the configuration is correct.
    • Check VNI connects to master Controllers (if control plane is enabled).

Controller clustering issues

The show control-cluster status command is the recommended command to view whether a Controller has joined a control cluster. 

The show control-cluster startup-nodes command was not designed to display all nodes currently in the cluster. Instead, it shows which other Controller nodes are used by this node to bootstrap membership into the cluster when the Controller process restarts. Accordingly, the command output may show some nodes which are shut down or have otherwise been pruned from the cluster.

Starting with NSX for vSphere release 6.1.5, 6.2.1 and later releases, the NSX Manager user interface displays the connectivity status between NSX Controllers in the Controller cluster. This display allows you to view connectivity differences between nodes. For example, in a cluster with three Controllers, A, B, and C, the control status would allow you to detect a partial cluster partition where A is connected to B and C, but B and C cannot view each other.

In addition, the show control-cluster network ipsec status command allows users to inspect the Internet Protocol Security (IPsec) state. Provide the output of this command and the Controller logs when reporting a suspected control cluster issue to your VMware technical support representative.  

For more information, see NSX Controller disconnected or isolates intermittently (2127655) for resolved issues with NSX control clusters. 

Controller deletion and recovery issues:

In the case of a Single NSX Controller failure, you may still have two Controllers that are working. The cluster majority is maintained and the control plane continues to function. Try a reboot of the controller first, if it fails to rejoin, delete all three Controllers and add new ones, do the "update controller state" so as to maintain a fully functional three-node cluster. 

In the case of multiple controller failures, delete all three controllers and add new ones, do the "update controller state" so as to maintain a fully functional three-node cluster.For more information, see the Recover from an NSX Controller Failure section in the NSX Administration Guide

In NSX for vSphere 6.2.4 or earlier, the only supported Controller virtual machine removal procedure is forceful removal:
  1. The node is powered off.
  2. The Controller VM is removed.
  3. The Controller VM identity is removed from the cluster.
  4. The IP address of Controller node is released.
NSX for vSphere 6.2.4 refines the Controller removal procedure to enhance operational manageability and introduced the graceful removal procedure. This procedure checks for these conditions before removing the node:
  • There is no ongoing Controller node upgrade operation.
  • There is no inactive Controller node when the operation is issued.
  • The Controller cluster is healthy, and a Controller cluster API request can be processed.
  • The host state, as obtained from the vCenter Server inventory, shows connected and powered on.
  • This is not the last Controller node.

Graceful Controller removal uses these sequence:

  1. Power off the node.
  2. Check cluster health.
  3. If the cluster is not healthy, power on the Controller and fail the removal request. 
  4. If the cluster is healthy, remove the Controller VM and release the IP address of the node.
  5. Remove the Controller VM's identity from the cluster. 

If a Controller deletion succeeds only partially and an entry is left behind in the NSX Manager database in a cross-vCenter environment, use the DELETE https://vsm-ip/api/2.0/vdn/controller/external API. If the Controller was imported through the NSX Manager API, use the removeExternalControllerReference API with the forceRemoval option.  

Deleting the Controller VM or powering it off directly in vCenter Server is not a supported operation. NSX for vSphere 6.2.4 reports a new Out of Sync status at the NSX user interface. Clicking on the Out of Sync status returns a window with a Resolve button.

netcpa (control plane) issues:

On NSX for vSphere, netcpa works as a local agent daemon, communicating with NSX Manager and with the Controller cluster. For resolved issues related to netcpa, see:
NSX for vSphere 6.2.0 release introduced a Communication Channel Health feature, which checks on the status of communication between NSX Manager and firewall agent, NSX Manager, control plane agent, and Controllers. For more information, see the Check Communication Channel Health section in the NSX for vSphere Administration Guide

NSX for vSphere 6.2.4 enhances the Communication Channel Health feature with these:

  • Provides error details during communication faults.
  • Generates an event when a channel goes into a wrong status.
  • Heartbeat messages now generated from NSX Manager to hosts.
For example, event log message when the channel goes into a wrong state:
 
GET https://<vsm_host_ip>/api/2.0/vdn/inventory/host/{hostId}/connection/status
 
Return:

<?xml version="1.0" encoding="UTF-8"?>
<hostConnStatus>
<hostName>10.161.246.20</hostName>
<hostId>host-21</hostId>
<nsxMgrToFirewallAgentConn>UP</nsxMgrToFirewallAgentConn>
<nsxMgrToControlPlaneAgentConn>UP</nsxMgrToControlPlaneAgentConn>
<hostToControllerConn>DOWN</hostToControllerConn>
<fullSyncCount>-1</fullSyncCount>
<hostToControllerConnectionErrors>
<hostToControllerConnectionError>
<controllerIp>10.160.203.236</controllerIp>
<errorCode>1255604</errorCode>
<errorMessage>Connection Refused</errorMessage>
</hostToControllerConnectionError>
<hostToControllerConnectionError>
<controllerIp>10.160.203.237</controllerIp>
<errorCode>1255603</errorCode>
<errorMessage>SSL Handshake Failure</errorMessage>
</hostToControllerConnectionError>
</hostToControllerConnectionErrors>
</hostConnStatus>
 
In addition to the overall channel status of UP, DOWN, or NOT AVAILABLE, NSX for vSphere 6.2.4 provides the more granular event codes and error details during a communication fault:
 
The following error codes are supported:
  • 1255601: Incomplete Host Certificate
  • 1255602: Incomplete Controller Certificate
  • 1255603: SSL Handshake Failure
  • 1255604: Connection Refused
  • 1255605: Keep-alive Timeout
  • 1255606: SSL Exception
  • 1255607: Bad Message
  • 1255620: Unknown Error

Storage latency issues:

Verify that your environment is not experiencing any high storage latencies. Zookeeper logs these messages (see the symptoms section when running this command show log cloudnet/cloudnet_java-zookeeper.<full file name>.log filtered-by fsync ) when storage latencies are greater than one second. VMware recommends dedicating a LUN specifically for the control-cluster and/or moving the storage array closer to the control-cluster in terms of latencies. For more information, see:
 
NSX for vSphere 6.2.4 enhances operational manageability by adding monitoring and logging of disk I/O latency. The read latency and write latency calculations are inputted into a 5-second (by default) moving average, which in turn is used to trigger an alert upon breaching the latency limit. The alert is turned off after the average comes down to the low watermark. By default, the high watermark is set to 200 ms, and the low watermark is set to 100 ms.  

nsx-controller # show disk-latency-alert config
enabled=True   low-wm=51      high-wm=150
nsx-controller # set disk-latency-alert enabled yes
nsx-controller # set disk-latency-alert low-wm 100
nsx-controller # set disk-latency-alert high-wm 200

A REST API also is available to fetch latency alert status of the Controller nodes.  
 
GET https://<VSM-IP>/api/2.0/vdn/controller/<controller-id>/systemStats

       Response:

       <?xml version="1.0" encoding="UTF-8"?>

       <controllerNodeStatus>

          <id>controller-2</id>

          <ipAddress>10.33.72.202</ipAddress>

          <syncTime>1455066817913</syncTime>

          <cpuCoreCount>4</cpuCoreCount>

          <cpuAverageLoad>

              <double>0.14</double>

              <double>0.17</double>

              <double>0.15</double>

          </cpuAverageLoad>

          <totalMemory>3926696</totalMemory>

          <usedMemory>2036432</usedMemory>

          <cachedMemory>1038076</cachedMemory>

          <totalSwap>4190204</totalSwap>

          <usedSwap>0</usedSwap>

          <systemTime>1455045226490</systemTime>

          <upTime>9132230</upTime>

          <nodeFailoverReady>false</nodeFailoverReady>

          <nodeDiskLatencyStatus>

              <deviceName>sda</deviceName>

              <refreshTime>1452365702000</refreshTime>

              <latencyType>w_await</latencyType>

              <lastLatency>0.0</lastLatency>

              <avgLatency>0.0</avgLatency>

              <alertEnabled>false</alertEnabled>

          </nodeDiskLatencyStatus>

          [output omitted] 
 
The response of the existing get Controller API contains a flag to indicate whether a disk latency alert is detected on a Controller node.

GET https://<VSM-IP>/api/2.0/vdn/controller
Response:
<?xml version="1.0" encoding="UTF-8"?>
<controllers>
   <controller>
       <objectTypeName>Controller</objectTypeName>
       <revision>0</revision>
       <name>Controller-1</name>
       …
       <diskLatencyAlertDetected>true<diskLatencyAlertDetected>
   </controller>
</controllers>

NSX Controller Logs:

To collect NSX Controller logs:

  1. Log in to vCenter Sever using the vSphere Web Client.
  2. Click Networking and Security.
  3. Click Installation link on the left pane.
  4. Click the Manage tab, select the Controller you want to download logs from.
  5. Click Download Tech support logs.
Notes:
  • New installations of NSX 6.2.4 will deploy NSX Controller appliances with updated disk partitions to provide extra cluster resiliency. In previous releases, log overflow on the Controller disk might impact Controller stability. In addition to adding log management enhancements to prevent overflows, the NSX Controller appliance has separate disk partitions for data and logs to safeguard against these events. If you upgrade to NSX 6.2.4, the NSX Controller appliances retains their original disk layout.
  • If your problem still exists after trying the steps in this article, file a support request with VMware Support and note this Knowledge Base Article ID (2125767) in the problem description. For more information, see How to Submit a Support Request.

Tags

controller fails, join cluster

See Also

Request a Product Feature

To request a new product feature or to provide feedback on a VMware product, please visit the Request a Product Feature page.

Feedback

  • 6 Ratings

Did this article help you?
This article resolved my issue.
This article did not resolve my issue.
This article helped but additional information was required to resolve my issue.

What can we do to improve this information? (4000 or fewer characters)




Please enter the Captcha code before clicking Submit.
  • 6 Ratings
Actions
KB: