ESXi becoming unresponsive when running vSAN cluster command on multiple hosts simultaneously

Products

VMware vSAN

Issue/Introduction

Troubleshooting non-responsive hosts in a vSAN cluster, with no apparent reason.
Troubleshooting vSAN Network Latency alert in Skyline Health

Symptoms:

When running one of the below commands on multiple hosts in the same cluster simultaneously or fairly close together, the hosts may become unresponsive.
esxcli vsan health cluster list
esxcli vsan health cluster get
You can also see similar alerts in vmkernel.log:

2018-10-17T08:25:01.798Z cpu3:17834932)ALERT: hostd detected to be non-responsive

The /var/log/vsansystem.log shows the OP ID:

2018-09-18T07:10:43.879Z error vsansystem[60FC26E700] [Originator@6876 sub=VsanSystemProvider opID=CMMDSAccessUpdate-607c] Unexpected exception when send hostd notification! Tag: vsanRuntimeInfo, E: N7Vmacore16TimeoutExceptionE(Operation timed out)

The /var/log/hostd.log:

2018-09-23T13:00:28.493Z warning hostd[34481B70] [Originator@6876 sub=Default] Failed to accept connection; <acceptor p:0x33d1fdf8, h:21, <TCP '127.0.0.1:8307'>>, e: system:24(Too many open files)

2018-09-17T07:34:47.398Z info hostd[34585B70] [Originator@6876 sub=VsanSimsStubImpl opID=875c6c9f user=dcui:vsanmgmtd] Calling vim.host.VsanSystem.queryHostStatus as task

The following traces indicate execution of "esxcli vsan health cluster list" command on multiple hosts at the same time, the threading pool on some hosts is consumed up, and hostd on the hosts happens to invoke some APIs (QueryHostStatus) to vsanmgmtd, these APIs are blocked in vsanmgmtd for waiting for worker threads, but in the meantime, the health check operation also has some requests to hostd. This causes deadlock at all hosts causing the hosts to become unresponsive in the vSAN cluster.

209406 2018-09-17T07:34:50Z VSANMGMTSVC: WARNING vsanperfsvc[269000d4-ba4c-11e8] [VsanHealthUtil::log] impl._QueryVerifyNetworkSettings: 4.34s

......

209575 2018-09-17T07:34:51Z VSANMGMTSVC: WARNING vsanperfsvc[27d94f6e-ba4c-11e8] [VsanHealthUtil::log] impl._QueryVerifyNetworkSettings: 2.82s

......

210705 2018-09-17T07:44:43Z VSANMGMTSVC: ERROR vsanperfsvc[23ae4ffe-ba4c-11e8] [VsanVcClusterHealthSystemImpl::_QueryClusterHealthSummary] Time out in executing QueryClusterHealthSummary

This may also impact the vSAN Network Latency test and lead to the vSAN Network Latency warning being triggered.

Environment

VMware vSAN 6.x

Cause

The “esxcli vsan health cluster get” will perform the full health check as well as “esxcli vsan health cluster list” from a resource consumption perspective.
‘esxcli vsan health cluster’ commands are implemented as cluster level operations and they should show the same output for all the hosts.
The host running this command will act as the role of vCenter to connect to all the hosts in the cluster to collect all vSAN health data.
If multiple hosts from one cluster are running the command together, it causes exhaustion of all the available vSAN worker thread pool resource and result in hostd hung.
This may also impact the vSAN Network Latency test and lead to the vSAN Network Latency warning being triggered.

Resolution

DO NOT run these commands on multiple hosts in the same cluster simultaneously.

Workaround:

Try restarting "vsanmgmt" service to free up the resources. However, it is not guaranteed to resolve the issue in case resources are being exhausted by other factors. The following command shown with multiple it's multiple options, is used for restarting the vsanmgmt services.
- /etc/init.d/vsanmgmtd {start|stop|restart|status}
If the host is still not responsive from vCenter restart hostd service: /etc/init.d/hostd {start|stop|restart|status}
If restarting "vsanmgmt" does not fix the issue, reboot the host to restart all services.