vRealize Automation 8.x system nodes are stuck on "starting docker engine" after a restart
search cancel

vRealize Automation 8.x system nodes are stuck on "starting docker engine" after a restart

book

Article ID: 318640

calendar_today

Updated On:

Products

VMware Aria Suite

Issue/Introduction

Symptoms:
  • After a vRealize Automation 8.x system has been running for a long time (few months), upon node restart, the appliance is stuck on "starting docker engine".
  • This may take up to a few hours, but eventually completes successfully.
  • The user can login to the system at this time in parallel via SSH and run "journalctl -u docker -f" and see messages like:
<MONTH> <DATE> HH:MM:SS <HOSTNAME> dockerd[679]: time="YYYY-MM-DDTHH:MM:SS.970559016Z" level=info msg="Removing stale sandbox badfb5b205f959b40f7eb587106b9a8d62f86393876ab18783926e0b116700d5 (bb731fa1ed1a5cc552487a69bbf2cefab536d252877997bd2e212c7f2b2b467e)"
<MONTH> <DATE> HH:MM:SS <HOSTNAME>
dockerd[679]: time="YYYY-MM-DDTHH:MM:SS.128605700Z" level=info msg="Removing stale sandbox 2bba66f9d6eef514c2815622c4f10a7c538f6a344c596740622ae7f4cee1c5f1 (e24cd80099d57af42d82d00a95b72f177d44f8d76482265095b060fee65cb14c)"
  • The same problem may be observed during upgrade as well. The system may remain for a few hours on the following step "Deactivating cluster of appliance nodes". This might take several minutes.".  In the meantime, LCM might timeout while the actual upgrade will eventually succeed on vRealize Automation 8.x.


Environment

VMware vRealize Automation 8.x
VMware vRealize Automation 8.1.x

Cause

  • While services operate continuously, they may create large content in their ephemeral storage. When services are power-cycled, docker and container do not remove all of this content, so the operations are not delayed.
  • The actual removal happens for the whole storage, once the docker service is restarted. It runs through all the content to confirm which part of it is ephemeral. This might take a few hours.

Resolution

To resolve this issue, upgrade to vRealize Automation versions 8.1 Patch 1 or 8.0.1 Patch 5 available from VMware Downloads.

Workaround:
To workaround the issue, install the below mentioned patch.  This patch addresses this problem for vRealize Automation 8.0.1 GA/Patch 1 - 5 and vRealize Automation 8.1 GA.  Once the patch is applied, next time you reboot or start an upgrade, the docker storage will be cleaned in batch and images will be re-imported from the backup for a constant amount of time that ranges from 4 to 10 minutes in correlations with available IOPS of the hosting infrastructure.

Use the following steps:
  1. Verify that all services are started and healthy by running:
kubectl -n prelude get pods
  • All services should be in 'Running' or 'Completed' status.
  • DO NOT proceed with step 3, if this is not confirmed
  1. Without powering off the nodes, take VM snapshots with no memory of all vRealize Automation nodes quickly one after another.
  2. On each vRealize Automation node, execute the following steps:
    • Upload 3943-patch-docker-service.tar.gz to /root
    • Patch the node by running from /root:
tar -xvf 3943-patch-docker-service.tar.gz && chmod a+x 3943-patch-docker-service.sh && ./3943-patch-docker-service.sh && rm 3943-patch-docker-service.*
 


Attachments

3943-patch-docker-service.tar get_app