In VMware NSX for vSphere 6.x, the NSX Edge experiences high CPU utilization and/or fails to accept configuration changes

Products

VMware NSX Networking

Issue/Introduction

Symptoms:

NSX Edge experiences high CPU utilization
NSX Edge fails to accept configuration changes
Running the show log command on the NSX Manager console reports entries similar to:

2015-10-15 09:31:32.473 UTC ERROR TaskFrameworkExecutor-1 PublishUtils:92 - Timeout happened during execution of jobId 'jobdata-173581' for edgeId 'edge-18', startTime '1444899795922' currentTime '1444901492473': doingRollback 'false'
2015-10-15 09:31:32.473 UTC ERROR TaskFrameworkExecutor-1 PublishTask:346 - Failed jobId 'jobdata-173581' for edge 'edge-18' during publishing.
com.vmware.vshield.edge.exception.VshieldEdgeException:
vShield Edge:10163:Publish Job jobdata-173581 for NSX Edge edge-18 timed out. It has already taken 28 minutes, hence was aborted and rollback has been performed.
Running the show log command on the NSX Manager console reports entries similar to:

2015-10-15 09:21:32.453 UTC INFO TaskFrameworkExecutor-1 AbstractEdgeApplianceManager:643 - The vse command is being sent to 'vm-31231' over msgBus
2015-10-15 09:31:32.453 UTC INFO messagingTaskExecutor-10 QueueSubscriptionManager:252 - Purging queue 'vse_5031887e-aa71-ede0-7078-e9e1c2bf6b94_request_queue'. No wait = 'true'.
2015-10-15 09:31:32.457 UTC INFO messagingTaskExecutor-10 VirtualMachineVcOperationsImpl:54 - Retrieving power-state for VM 'PRNESG003368-1'
2015-10-15 09:31:32.462 UTC INFO messagingTaskExecutor-10 VirtualMachineVcOperationsImpl:57 - Power-state for VM 'PRNESG003368-1' = 'poweredOn'
2015-10-15 09:31:32.462 UTC INFO messagingTaskExecutor-10 EdgeUtils:302 - SysEvent-Detailed-Message :(Kept only in logs) :: Rpc request to vm: vm-31231 timed out
2015-10-15 09:31:32.466 UTC INFO messagingTaskExecutor-10 SystemEventDaoImpl:128 - [SystemEvent] Time:'Thu Oct 15 09:31:32.462 UTC 2015', Severity:'Major', Event Source:'vm-31231', Code:'30014', Event Message:'Failed to communicate with the NSX Edge VM.', Module:'NSX Edge Commnication Agent

For more information, see Collecting diagnostic information for VMware NSX for vSphere 6.x (2074678).

Note: The preceding log excerpts are only examples. Date, time, and environmental variables may vary depending on your environment.

Environment

VMware NSX for vSphere 6.3.x
VMware NSX for vSphere 6.1.x
VMware NSX for vSphere 6.2.x

Cause

This issue occurs when an Edge virtual machine fails to initialize after being redeployed.

In addition, RPC timeout messages may be seen when the NSX Manager and Edge cannot communicate. Such communication occurs through the VIX channel if the Edge resides on a vSphere ESXi host which has not been prepared for NSX. If the ESXi host has been prepared, the communication occurs through the message bus channel.

Resolution

Validate that each troubleshooting step below is true for your environment. Each step provides instructions or a link to a document, to eliminate possible causes and take corrective action as necessary. The steps are ordered in the most appropriate sequence to isolate the issue and identify the proper resolution. Do not skip a step.

Check message bus status using the API https://NSX_Manager_IP/api/2.0/nwfabric/status?resource=MOID_OF_CLUSTER.

Note: A working message bus will return a green status.
Run this command in the NSX Edge to determine id the message bus is enabled and Rabbit MQ channels are listening:

show messagebus messages

You see output similar to:

Message bus is enabled
cmd conn state : listening
init_req : 14
init_resp : 14
init_req_err : 0
init_resp_err : 0
pwchg_req : 1
pwchg_resp : 1
pwchg_resp_ok : 1
pwchg_resp_fail: 0
pwchg_updated : 1
pwchg_req_err : 0
pwchg_resp_err : 0
pwchg_resp_miss: 0
cert_change : 0
cmd_req : 362
cmd_resp : 361
cmd_invalid : 0
cmd_req_err : 0
cmd_req_abort : 13
cmd_resp_err : 0
em_req : 361
em_resp : 360
em_req_err : 0
em_resp_invalid: 0
em_resp_timeout: 0
em_resp_err : 0
hb : 73743
hb_rx_err : 0
hb_ack_err : 0
cmd_ch_conn : 59
cmd_login_fail : 0
msg_thr_rstart : 45
-----------------------
evt conn state : listening
vse_rx : 223719
vse_rx_hc : 223717
vse_rx_evt : 2
vse_rx_msg : 22347
vse_rx_hc_empty: 0
vse_rx_err : 0
vse_tx_hc : 223727
vse_tx_evt : 2
vse_tx_hc_err : 14
vse_tx_evt_err : 0
evt_rsp : 2
evt_rsp_no_file: 0
evt_rsp_more : 0
evt_rsp_push : 0
evt_ch_conn : 23
evt_login_fail : 0
vse_thr_rstart : 0
-----------------------
cli_rx : 2
cli_tx : 2
cli_tx_err : 0
cli_thr_rstart : 0
counters_reset : 0
Run this command in the NSX Edge to determine if the VMCI channels to the vSphere ESXi host are up:

show messagebus forwarder

You see the output similar to:

Forwarder Command Channel
vmci_conn : up
app_client_conn : up
vmci_rx : 74427
vmci_tx : 74446
vmci_rx_err : 0
vmci_tx_err : 0
vmci_closed_by_peer: 58
vmci_tx_no_socket : 0
app_rx : 74446
app_tx : 74427
app_rx_err : 0
app_tx_err : 0
app_conn_req : 59
app_closed_by_peer : 0
app_tx_no_socket : 0
-----------------------
Forwarder Event Channel
vmci_conn : up
app_client_conn : up
vmci_rx : 22494
vmci_tx : 224001
vmci_rx_err : 0
vmci_tx_err : 0
vmci_closed_by_peer: 22
vmci_tx_no_socket : 0
app_rx : 224001
app_tx : 22494
app_rx_err : 0
app_tx_err : 0
app_conn_req : 23
app_closed_by_peer : 0
app_tx_no_socket : 0
-----------------------
cli_rx : 2
cli_tx : 2
cli_tx_err : 0
counters_reset : 0

The vmci_closed_by_peer counter records the number of times that the connection has been closed by the host agent. An incrementing value and vmci_conn: down status indicate that the host agent cannot connect to the RMQ broker. To validate this step further, run the show log follow command and search for messages similar to VmciProxy: [daemon.debug] VMCI Socket is closed by peer.
To check the health of the connections from the host side, use the esxcli network ip connection list | grep 5671 command.

~ # esxcli network ip connection list | grep 5671
tcp 0 0 10.32.43.4:43329 10.32.43.230:5671 ESTABLISHED 35854 newreno vsfwd
tcp 0 0 10.32.43.4:52667 10.32.43.230:5671 ESTABLISHED 35854 newreno vsfwd
tcp 0 0 10.32.43.4:20808 10.32.43.230:5671 ESTABLISHED 35847 newreno vsfwd
tcp 0 0 10.32.43.4:12486 10.32.43.230:5671 ESTABLISHED 35847 newreno vsfwd

If the output fails to show as ESTABLISHED , collect the /var/log/vsfwd.log file and open a support request. For more information, see How to file a Support Request in Customer Connect (2006985).

Note: NSX for vSphere release 6.1.5 resolves known issues with publishing time out issues by aggregating publishing jobs to enhance performance. For more information, see the NSX for vSphere 6.1.5 Release Notes.

Additional Information

How to file a Support Request in Customer Connect
Collecting diagnostic information for VMware NSX for vSphere 6.x
In VMware NSX for vSphere 6.x, the NSX Edge experiences high CPU utilization and/or fails to accept configuration changes
在 VMware NSX for vSphere 6.x 中，NSX Edge 遇到高 CPU 利用率和/或无法接受配置更改