Edge Load Balancer service status is UNKNOWN or NO

Products

VMware NSX Networking

Issue/Introduction

Symptoms:
The following symptoms are observed

Load Balancer operational status is UNKNOWN or NO_STANDBY
On the Active or Standby Edge, the load balancer service has crashed creating a core file e.g.

/var/log/core/core.nginx.1550358639.gz

Syslog contains this error

/var/log/syslog
2019-03-03T19:47:31.740097+00:00 hostname kernel - - - [370535.138924] grsec: Invalid alignment/Bus error occurred at 00006cfd318d6000 in /opt/vmware/nsx-edge/bin/nginx[nginx:26545] uid/euid:134/134 gid/egid:140/140, parent /opt/vmware/nsx-edge/bin/nsd[nsd:4807] uid/euid:0/0 gid/egid:149/149

"lb-dispatcher" process may not be running

#pidof lb-dispatcher >return no output

Environment

VMware NSX-T Data Center
VMware NSX-T Data Center 2.x

Cause

This Issue is observed if there are a large number of load balancer reconfiguration tasks. Over time, due to a memory leak, the load balancer memory is exhausted and the service crashes.

The "lb-dispatcher" service may not be restarted after the crash. Root cause is that when a service is hanging and system tries to stop (SIGTERM) it and timed out. Systemd won’t try to restart it and will mark it as failed state.

<28>1 2019-04-21T23:22:16.353836+00:00 rtp-NP-nsxtedge10 systemd 1 - - nsx-edge-dispatcher.service: State 'stop-sigterm' timed out. Killing.
<29>1 2019-04-21T23:22:18.323229+00:00 rtp-NP-nsxtedge10 systemd 1 - - nsx-edge-dispatcher.service: Main process exited, code=killed, status=9/KILL
<30>1 2019-04-21T23:22:18.348032+00:00 rtp-NP-nsxtedge10 systemd 1 - - Stopped Edge LB Dispatcher.
<29>1 2019-04-21T23:22:18.348352+00:00 rtp-NP-nsxtedge10 systemd 1 - - nsx-edge-dispatcher.service: Unit entered failed state.
<28>1 2019-04-21T23:22:18.348721+00:00 rtp-NP-nsxtedge10 systemd 1 - - nsx-edge-dispatcher.service: Failed with result 'signal'.

Resolution

This issue is resolved in VMware NSX-T Data Center 2.4, available at VMware Downloads.

Workaround:
The Edge VM can be rebooted to workaround the issue . As the issue can be hit again before upgarding to NSX-T 2.4, the recommendation is to add cron job on each LB edge node to periodically kill the lb-dispatcher process to avoid hitting memory leak in future. If the edge already hits the memory leak then killing just the lb-dispatcher will not help.

Steps:

1. check if "lb-dispatcher" process running
#pidof lb-dispatcher

2. collect edge log bundle (include core dump files) then clean up the core files (under /var/log/core)

3a: If "lb-dispatcher" is running , then kill the lb-dispatcher PID
#kill `pidof lb-dispatcher`

3b. If "lb-dispatcher" is NOT running , then reboot the Edge .

4. create a cron job of killing the lb-pid every night, till you upgrade to 2.4

On each Edge VM edit /etc/crontab
0 1 * * * /bin/sh kill `pidof lb-dispatcher` &> /dev/null
(This cron example will result in the LB service restarting every night at 1am resetting the memory usage and preventing the service crashing during business hours.)