- Troubleshooting non-responsive hosts in a vSAN cluster, with no apparent reason.
- Troubleshooting vSAN Network Latency alert in Skyline Health
Symptoms:
- When running one of the below commands on multiple hosts in the same cluster simultaneously or fairly close together, the hosts may become unresponsive.
- esxcli vsan health cluster list
- esxcli vsan health cluster get
- You can also see similar alerts in vmkernel.log:
2018-10-17T08:25:01.798Z cpu3:17834932)ALERT: hostd detected to be non-responsive
- The /var/log/vsansystem.log shows the OP ID:
2018-09-18T07:10:43.879Z error vsansystem[60FC26E700] [Originator@6876 sub=VsanSystemProvider opID=CMMDSAccessUpdate-607c] Unexpected exception when send hostd notification! Tag: vsanRuntimeInfo, E: N7Vmacore16TimeoutExceptionE(Operation timed out)
2018-09-23T13:00:28.493Z warning hostd[34481B70] [Originator@6876 sub=Default] Failed to accept connection; <acceptor p:0x33d1fdf8, h:21, <TCP '127.0.0.1:8307'>>, e: system:24(Too many open files)
2018-09-17T07:34:47.398Z info hostd[34585B70] [Originator@6876 sub=VsanSimsStubImpl opID=875c6c9f user=dcui:vsanmgmtd] Calling vim.host.VsanSystem.queryHostStatus as task
The following traces indicate execution of "esxcli vsan health cluster list" command on multiple hosts at the same time, the threading pool on some hosts is consumed up, and hostd on the hosts happens to invoke some APIs (QueryHostStatus) to vsanmgmtd, these APIs are blocked in vsanmgmtd for waiting for worker threads, but in the meantime, the health check operation also has some requests to hostd. This causes deadlock at all hosts causing the hosts to become unresponsive in the vSAN cluster.
209406 2018-09-17T07:34:50Z VSANMGMTSVC: WARNING vsanperfsvc[269000d4-ba4c-11e8] [VsanHealthUtil::log] impl._QueryVerifyNetworkSettings: 4.34s
......
209575 2018-09-17T07:34:51Z VSANMGMTSVC: WARNING vsanperfsvc[27d94f6e-ba4c-11e8] [VsanHealthUtil::log] impl._QueryVerifyNetworkSettings: 2.82s
......
210705 2018-09-17T07:44:43Z VSANMGMTSVC: ERROR vsanperfsvc[23ae4ffe-ba4c-11e8] [VsanVcClusterHealthSystemImpl::_QueryClusterHealthSummary] Time out in executing QueryClusterHealthSummary
- This may also impact the vSAN Network Latency test and lead to the vSAN Network Latency warning being triggered.