In VMware NSX for vSphere 6.x, the NSX Edge experiences high CPU utilization and/or fails to accept configuration changes
search cancel

In VMware NSX for vSphere 6.x, the NSX Edge experiences high CPU utilization and/or fails to accept configuration changes

book

Article ID: 345647

calendar_today

Updated On:

Products

VMware NSX Networking

Issue/Introduction

Symptoms:
  • NSX Edge experiences high CPU utilization
  • NSX Edge fails to accept configuration changes
  • Running the show log command on the NSX Manager console reports entries similar to:

    2015-10-15 09:31:32.473 UTC ERROR TaskFrameworkExecutor-1 PublishUtils:92 - Timeout happened during execution of jobId 'jobdata-173581' for edgeId 'edge-18', startTime '1444899795922' currentTime '1444901492473': doingRollback 'false'
    2015-10-15 09:31:32.473 UTC ERROR TaskFrameworkExecutor-1 PublishTask:346 - Failed jobId 'jobdata-173581' for edge 'edge-18' during publishing.
    com.vmware.vshield.edge.exception.VshieldEdgeException:
    vShield Edge:10163:Publish Job jobdata-173581 for NSX Edge edge-18 timed out. It has already taken 28 minutes, hence was aborted and rollback has been performed.


  • Running the show log command on the NSX Manager console reports entries similar to:

    2015-10-15 09:21:32.453 UTC INFO TaskFrameworkExecutor-1 AbstractEdgeApplianceManager:643 - The vse command is being sent to 'vm-31231' over msgBus
    2015-10-15 09:31:32.453 UTC INFO messagingTaskExecutor-10 QueueSubscriptionManager:252 - Purging queue 'vse_5031887e-aa71-ede0-7078-e9e1c2bf6b94_request_queue'. No wait = 'true'.
    2015-10-15 09:31:32.457 UTC INFO messagingTaskExecutor-10 VirtualMachineVcOperationsImpl:54 - Retrieving power-state for VM 'PRNESG003368-1'
    2015-10-15 09:31:32.462 UTC INFO messagingTaskExecutor-10 VirtualMachineVcOperationsImpl:57 - Power-state for VM 'PRNESG003368-1' = 'poweredOn'
    2015-10-15 09:31:32.462 UTC INFO messagingTaskExecutor-10 EdgeUtils:302 - SysEvent-Detailed-Message :(Kept only in logs) :: Rpc request to vm: vm-31231 timed out
    2015-10-15 09:31:32.466 UTC INFO messagingTaskExecutor-10 SystemEventDaoImpl:128 - [SystemEvent] Time:'Thu Oct 15 09:31:32.462 UTC 2015', Severity:'Major', Event Source:'vm-31231', Code:'30014', Event Message:'Failed to communicate with the NSX Edge VM.', Module:'NSX Edge Commnication Agent


    For more information, see Collecting diagnostic information for VMware NSX for vSphere 6.x (2074678).

    Note: The preceding log excerpts are only examples. Date, time, and environmental variables may vary depending on your environment.


Environment

VMware NSX for vSphere 6.3.x
VMware NSX for vSphere 6.1.x
VMware NSX for vSphere 6.2.x

Cause

This issue occurs when an Edge virtual machine fails to initialize after being redeployed.

In addition, RPC timeout messages may be seen when the NSX Manager and Edge cannot communicate. Such communication occurs through the VIX channel if the Edge resides on a vSphere ESXi host which has not been prepared for NSX. If the ESXi host has been prepared, the communication occurs through the message bus channel.

Resolution

Validate that each troubleshooting step below is true for your environment. Each step provides instructions or a link to a document, to eliminate possible causes and take corrective action as necessary. The steps are ordered in the most appropriate sequence to isolate the issue and identify the proper resolution. Do not skip a step.
  1. Check message bus status using the API https://NSX_Manager_IP/api/2.0/nwfabric/status?resource=MOID_OF_CLUSTER.

    Note: A working message bus will return a green status.

  2. Run this command in the NSX Edge to determine id the message bus is enabled and Rabbit MQ channels are listening:

    show messagebus messages

    You see output similar to:

    Message bus is enabled
    cmd conn state : listening
    init_req : 14
    init_resp : 14
    init_req_err : 0
    init_resp_err : 0
    pwchg_req : 1
    pwchg_resp : 1
    pwchg_resp_ok : 1
    pwchg_resp_fail: 0
    pwchg_updated : 1
    pwchg_req_err : 0
    pwchg_resp_err : 0
    pwchg_resp_miss: 0
    cert_change : 0
    cmd_req : 362
    cmd_resp : 361
    cmd_invalid : 0
    cmd_req_err : 0
    cmd_req_abort : 13
    cmd_resp_err : 0
    em_req : 361
    em_resp : 360
    em_req_err : 0
    em_resp_invalid: 0
    em_resp_timeout: 0
    em_resp_err : 0
    hb : 73743
    hb_rx_err : 0
    hb_ack_err : 0
    cmd_ch_conn : 59
    cmd_login_fail : 0
    msg_thr_rstart : 45
    -----------------------
    evt conn state : listening
    vse_rx : 223719
    vse_rx_hc : 223717
    vse_rx_evt : 2
    vse_rx_msg : 22347
    vse_rx_hc_empty: 0
    vse_rx_err : 0
    vse_tx_hc : 223727
    vse_tx_evt : 2
    vse_tx_hc_err : 14
    vse_tx_evt_err : 0
    evt_rsp : 2
    evt_rsp_no_file: 0
    evt_rsp_more : 0
    evt_rsp_push : 0
    evt_ch_conn : 23
    evt_login_fail : 0
    vse_thr_rstart : 0
    -----------------------
    cli_rx : 2
    cli_tx : 2
    cli_tx_err : 0
    cli_thr_rstart : 0
    counters_reset : 0


  3. Run this command in the NSX Edge to determine if the VMCI channels to the vSphere ESXi host are up:

    show messagebus forwarder

    You see the output similar to:

    Forwarder Command Channel
    vmci_conn : up
    app_client_conn : up
    vmci_rx : 74427
    vmci_tx : 74446
    vmci_rx_err : 0
    vmci_tx_err : 0
    vmci_closed_by_peer: 58
    vmci_tx_no_socket : 0
    app_rx : 74446
    app_tx : 74427
    app_rx_err : 0
    app_tx_err : 0
    app_conn_req : 59
    app_closed_by_peer : 0
    app_tx_no_socket : 0
    -----------------------
    Forwarder Event Channel
    vmci_conn : up
    app_client_conn : up
    vmci_rx : 22494
    vmci_tx : 224001
    vmci_rx_err : 0
    vmci_tx_err : 0
    vmci_closed_by_peer: 22
    vmci_tx_no_socket : 0
    app_rx : 224001
    app_tx : 22494
    app_rx_err : 0
    app_tx_err : 0
    app_conn_req : 23
    app_closed_by_peer : 0
    app_tx_no_socket : 0
    -----------------------
    cli_rx : 2
    cli_tx : 2
    cli_tx_err : 0
    counters_reset : 0


    The vmci_closed_by_peer counter records the number of times that the connection has been closed by the host agent. An incrementing value and vmci_conn: down status indicate that the host agent cannot connect to the RMQ broker. To validate this step further, run the show log follow command and search for messages similar to VmciProxy: [daemon.debug] VMCI Socket is closed by peer.

  4. To check the health of the connections from the host side, use the esxcli network ip connection list | grep 5671 command.

    ~ # esxcli network ip connection list | grep 5671
    tcp 0 0 10.32.43.4:43329 10.32.43.230:5671 ESTABLISHED 35854 newreno vsfwd
    tcp 0 0 10.32.43.4:52667 10.32.43.230:5671 ESTABLISHED 35854 newreno vsfwd
    tcp 0 0 10.32.43.4:20808 10.32.43.230:5671 ESTABLISHED 35847 newreno vsfwd
    tcp 0 0 10.32.43.4:12486 10.32.43.230:5671 ESTABLISHED 35847 newreno vsfwd


    If the output fails to show as ESTABLISHED , collect the /var/log/vsfwd.log file and open a support request. For more information, see How to file a Support Request in Customer Connect (2006985).

    Note: NSX for vSphere release 6.1.5 resolves known issues with publishing time out issues by aggregating publishing jobs to enhance performance. For more information, see the NSX for vSphere 6.1.5 Release Notes.


Additional Information

How to file a Support Request in Customer Connect
Collecting diagnostic information for VMware NSX for vSphere 6.x
In VMware NSX for vSphere 6.x, the NSX Edge experiences high CPU utilization and/or fails to accept configuration changes
在 VMware NSX for vSphere 6.x 中,NSX Edge 遇到高 CPU 利用率和/或无法接受配置更改