vCenter Server vpxd service crashes due to "Too many outstanding operations"

search cancel

vCenter Server vpxd service crashes due to "Too many outstanding operations"

book

Article ID: 318208

calendar_today

Updated On:

Products

VMware vCenter Server

Issue/Introduction

Symptoms:

The vpxd service on vCenter Server crashes, producing a core.vpxd-worker file.
Messages in /var/log/vmware/vpxd/vpxd.log similar to the following immediately precede the crash:

2022-10-16T23:33:33.233-04:00 info vpxd[50262] [Originator@6876 sub=vpxLro opID=lro-76175637-7d814bb5] [VpxLRO] -- BEGIN lro-76175637 -- -- DrmExecute --
2022-10-16T23:33:33.233-04:00 info vpxd[50262] [Originator@6876 sub=drsExec opID=lro-76175637-7d814bb5] Executing DRS recommendation 1234 with 1 actions
2022-10-16T23:33:33.235-04:00 error vpxd[50262] [Originator@6876 sub=vpxLro opID=lro-76175637-7d814bb5] [VpxLRO] Unexpected Exception: N5Vmomi5Fault11SystemError9ExceptionE(Fault cause: vmodl.fault.SystemError
2022-10-16T23:33:33.236-04:00 info vpxd[50262] [Originator@6876 sub=Default opID=lro-76175637-7d814bb5] [VpxLRO] -- ERROR lro-76175637 -- -- DrmExecute: vmodl.fault.SystemError:
--> Result:
--> (vmodl.fault.SystemError) {
--> faultCause = (vmodl.MethodFault) null,
--> faultMessage = <unset>,
--> reason = "Too many outstanding operations"
--> msg = ""

Environment

VMware vCenter Server 7.0.x

Cause

This issue manifests when a specific routine task (vScheduleCheckVsanConfigLro) is started during a time when the vCenter's LRO job queue is already full of queued tasks.

Resolution

This is resolved in vCenter Server 7.0 Update 3i (build number 20845200).

Workaround:
To work around this issue, use the vpxd log to identify and remedy the cause of the excessively queued LRO tasks.

While the root cause of this issue is the specific task causing vpxd to crash instead of handling "Too many outstanding operations" properly, the fact that vCenter has so many queued tasks is also something that can be investigated. The source of the queued tasks might come from an ESXi host which has partially stopped responding to commands sent to it by vCenter.

For example, if before the crash, there are many vmodl.fault.HostCommunication messages in vpxd.log pertaining to a specific host, that host might need to be set into maintenance mode for further troubleshooting. It's also possible that a solution external to vCenter is sending too many requests for vCenter to process in time, and eventually filling up the LRO job queue.

Feedback

thumb_up Yes

thumb_down No