Describe symptoms of a rare issue.
Provisioning new desktops from Horizon onto VMC on AWS vSphere clusters begin to have provisioning failures with status such as "Error ...", and "Provisioning Error (missing)
1. Horizon starts a set of provisioning tasks which VC receives and starts taking action on. Each of these tasks is assigned an ID that horizon is aware of.
2. During the process of these tasks being executed, a change is made at the host level object in VC which updates the permissions of the horizon account to "No Access". This is part of the VMC ESXi host rolling reboot process for Firmware updates.
journal log on VC
<MONTH> <TIME> vcenter.sddc-x-x-x-x.vmwarevmc.com vpxd[35070]: Event [53229] [1-1] [<DATE>T<TIME>Z] [vim.event.PermissionUpdatedEvent] [info] [VMC.LOCAL\mobla] [SDDC-Datacenter] [53229] [Permission changed for 'VMC.LOCAL\CloudAdminGroup' on '172.x.x.x'. Role changed from 'CloudAdmin' to role 'No access'. Propagate changed from 'Enabled' to 'Enabled'.]
3. Later, Horizon sends a call to VC to update the list of objects (likely tasks) in its view. This updated list includes an object/task that the horizon account no longer has permissions on.
vpxd.log
<DATE>T<TIME>Z verbose vpxd[41541] [Originator@6876 sub=Default opID=<ID>] [VpxVmomi] Invoking; <<<ID>, <TCP '<IP> : 8085'>, <TCP '<IP> : 60334'>>, session[<ID>]<ID>, vim.view.ListView.modify>
4. VC denies the vim.view.ListView.modify call with vim.fault.NoPermission. Following this, more issues cascade due to the listener thread being killed in Horizon.
vpxd.log
<DTAE>T<TIME>Z verbose vpxd[41541] [Originator@6876 sub=Vmomi opID=<ID>] Invoke error; <<<ID>, <TCP '<IP>: 8085'>, <TCP '<IP> : 60334'>>, session[<ID>]<ID>, vim.view.ListView.modify> Throw: vim.fault.NoPermission
The underlying root cause is due to VMC on AWS rolling reboot maintenance for ESXi firmware updates. Currently during this scripted automation task the access of CloudAdmin changes to NoAccess on the ESXi host as it is removed for maintenance. Meanwhile Horizon has a process which will use the CloudAdmin user to query data from the host for objects that Horizon is tracking. This results in a retry loop until the running threads for Horizon's tracker process are consumed resulting in the process exiting completely as of Horizon v7.13.1.
The immediate remediation is to restart the Horizon Connection Server that is impacted.
The impact in this scenario to Horizon is a failure to track objects, and a failure to provision new desktops dynamically to a Horizon pool.