NSX-T UI alarms are generated: Application on NSX node <node> has crashed.

Products

VMware NSX Networking

Issue/Introduction

Symptoms:

You are seeing the following alarm in NSX-T UI :

Application on NSX node <node> has crashed. The number of core files found is 1. Collect the Support Bundle including core dump files and contact VMware Support team. Recommended Action Collect Support Bundle for NSX node <nsx manager> using NSX Manager UI or API.

Checking /var/log/syslog.log on the NSX-T appliance node (Unified appliance, Edge, etc), you can see messages similar to:

2023-05-19T02:50:34.898Z local-manager NSX 85581 MONITORING [nsx@6876 alarmId="e44e47ae-8c4c-47aa-85a9-7a159b72d7ee" alarmState="OPEN" comp="nsx-manager" entId="340cd33e-fec7-46cd-91d5-ff3b6fc90faf" errorCode="MP701099" eventFeatureName="infrastructure_service" eventSev="CRITICAL" eventState="On" eventType="application_crashed" level="FATAL" nodeId="d1be0142-b001-01f5-8bdb-d5ae7b37180b" subcomp="monitoring"] Application on NSX node local-manager has crashed. The number of core files found is 1. Collect the Support Bundle including core dump files and contact VMware Support team.

In the case of the node being an ESXi host transport node , same messages as above can be found in /var/log/nsx-syslog.log as below:

2023-05-18T10:07:31Z nsx-sha: NSX 268653 - [nsx@6876 comp="nsx-esx" subcomp="nsx-sha" username="root" level="CRITICAL" eventFeatureName="infrastructure_service" eventType="application_crashed" eventSev="critical" eventState="On" entId="76a85727-30ab-4ff5-bb7c-a064668252f0"] Application on NSX node sc2-10-185-106-230.esxi.host.com has crashed. The number of core files found is 1. Collect the Support Bundle including core dump files and contact VMware Support team.

Note: The preceding log excerpts are only examples. Date, time, and environmental variables may vary depending on your environment.

Environment

VMware NSX-T Data Center

Cause

Services have crashed and the system generated their respective cored dump files. All NSX services are configured to be auto-restarted after hitting a crash. Depending on the application which has crashed it might be possible other services depending on it may not be functioning correctly. It is recommended to verify the services status which have crashed to confirm whether it's running.

On the NSX-T appliance node, service status can be verified in nsxcli as below:

nsxcli> get service <service-name>
or
nsxcli> get services

Application crash should have generated a core or heap dump on the NSX node, which can be verified in nsx cli as below:

nsxcli> get core-dumps
Directory: /var/core
20762624 May 18 2023 11:44:13 UTC nsx-exporter-zdump.000
26832896 May 18 2023 10:04:59 UTC opsAgent-zdump.000

Note: In the above example output - a couple of services nsx-exporter and opsAgent crashed and the system generated their respective core dump files.

Resolution

This is a known issue impacting VMware NSX .

VMware strive towards building quality products and in order to continue delivering the best - Engineering teams at VMware are inclined to learn such issues from its customers.

In this case as mentioned the Alarm - application crashed, is related to NSX services or a certain environmental factor which might have hit a fatal or unhandled exception causing core or heap dump generation.

Hence, application crashed issues need to be reported to VMware support team, so that NSX services can be made more robust in coming releases.

In order to report application crashed issues , kindly refer to the steps below:
1. Collect the latest support-bundle with core dump and audit logs from the node(s) where application crashed alarm is observed, please refer to the following document for details on how to collect the support bundle with core and audit logs:
https://docs.vmware.com/en/VMware-NSX-T-Data-Center/3.2/administration/GUID-73D9AF0D-4000-4EF2-AC66-6572AD1A0B30.html

On NSX appliances, if needed to copy individual core dump files to a remote location following nsxcli command, kindly refer to the command below:

nsxcli> copy core-dump core.nginx.1559278043.gz url scp://[email protected]/tmp/
[email protected]'s password:

If you face this issue, kindly collect a support-bundle, contact VMware Support and file a Support Request (refer to KB article "How to file a Support Request in Customer Connect" https://kb.vmware.com/s/article/2006985).

Workaround:

After collecting the support-bundle, the application crashed alarm can be resolved by removing the core dump files from the respective nodes.

On NSX appliance nodes, the following nsxcli command can be used respective of the NSX version to remove core and heap dump files:
For NSX version 4.1 or below:

nsxcli> del core-dump all
or
nsxcli> del core-dump <core-dump-file>

For NSX 4.1.1 or above:

Use the command below to collect support-bundle with core dump and audit log while also deleting the core dump files at same the time:
nsxcli> get support-bundle file support-bundle.tgz all remove-core-files

On ESXi host transport nodes, the following nsxcli / root commands can be used respective of the NSX version in use when logged in from the host as root :

For NSX version 4.1 or below:

root> rm -rf /var/core

For NSX 4.1.1 or above:

nsxcli> del core-dump all
or
nsxcli> del core-dump <core-dump-file>

Additional Information

Please find below a list of KB's with know core dump issues:

https://kb.vmware.com/s/article/91712 - cfgagent core dump generated on ESXi host.
https://kb.vmware.com/s/article/93530 - core.nvpapi and core.sudo core dump generated on edge node.
https://kb.vmware.com/s/article/92163 - migration_oom.hprof core dump generated on NSX-T manager.
https://ikb.vmware.com/s/article/94418 - opsAgent core dump generated on ESXi host.
https://ikb.vmware.com/s/article/94529 - VDPI crash due to FQDN change in context firewall rule.