LCM service crashing on SDDC Manager

Products

VMware Cloud Foundation

Issue/Introduction

Symptoms:

Similar errors maybe observed in the SDDC Manager UI:

Similar errors are found in the lcm.log

/var/log/vmware/vcf/lcm/lcm.log
======================================================================================
2023-11-08T16:09:14.916+0000 INFO  [vcf_lcm,0000000000000000,0000] [c.v.e.s.l.s.i.BundleGraphServiceImpl,main] Verified bundle recall status:false for bundle:6dcb75f5-6fe8-463b-953e-9b0d92ee4116
2023-11-08T16:09:14.917+0000 ERROR [vcf_lcm,0000000000000000,0000] [c.v.e.s.l.s.i.BundleGraphServiceImpl,main] Failed to add edge in bundle graph for VCENTER.
2023-11-08T16:09:14.924+0000 INFO  [vcf_lcm,0000000000000000,0000] [o.a.coyote.http11.Http11NioProtocol,main] Pausing ProtocolHandler ["http-nio-127.0.0.1-7400"]
2023-11-08T16:09:14.924+0000 INFO  [vcf_lcm,0000000000000000,0000] [o.a.catalina.core.StandardService,main] Stopping service [Tomcat]
2023-11-08T16:09:14.938+0000 WARN  [vcf_lcm,0000000000000000,0000] [o.a.c.loader.WebappClassLoaderBase,main] The web application [ROOT] appears to have started a thread named [HikariPool-1 housekeeper] but has failed to stop it. This is very likely to create a memory leak. Stack trace of thread:
 [email protected]/jdk.internal.misc.Unsafe.park(Native Method)
 [email protected]/java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:234)
 [email protected]/java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2123)
 [email protected]/java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:1182)
 [email protected]/java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:899)
 [email protected]/java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1054)
 [email protected]/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1114)
 [email protected]/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
 [email protected]/java.lang.Thread.run(Thread.java:829)
2023-11-08T16:09:14.939+0000 INFO  [vcf_lcm,0000000000000000,0000] [o.a.coyote.http11.Http11NioProtocol,main] Stopping ProtocolHandler ["http-nio-127.0.0.1-7400"]
2023-11-08T16:09:14.941+0000 INFO  [vcf_lcm,0000000000000000,0000] [o.a.coyote.http11.Http11NioProtocol,main] Destroying ProtocolHandler ["http-nio-127.0.0.1-7400"]
2023-11-08T16:09:14.972+0000 INFO  [vcf_lcm,0000000000000000,0000] [o.s.b.a.l.ConditionEvaluationReportLoggingListener,main]

Warning: While the absence of the error for a connected customer does not guarantee they are unaffected, it is advisable for them to abstain from applying any AP Tool Patches or upgrades. Instead, they should wait for the scripted solution to be attached to this KB. If they choose to proceed with the patch or upgrade, they should take precautionary steps. Verify whether they are impacted by executing the following commands: 'systemctl restart lcm' and 'systemctl status lcm.' The latter will indicate if the restart of lcm was successful. In the event of a failure, they should proceed with the remediation steps below.

Note: In some isolated cases the LCM service will not crash despite the SDDC Manager being affected by this issue. Upon a service restart of the LCM service, we may see the SDDC Manager UI being unresponsive. It is recommended to run through the remediation process regardless as this issue affects all SDDC Managers connected to the VMware depot.

Environment

VMware Cloud Foundation 3.x
VMware Cloud Foundation 4.x
VMware Cloud foundation 5.x

Cause

This is due to an invalid index file that was pushed recently to the VMware depot. As a result of the invalid entries in the index file, the LCM service has downloaded several incorrect upgrade bundles. While parsing the incorrect bundles, the LCM service crashes and is unable to recover.

Resolution

0. Take a snapshot of the SDDC Manager VM

1. Download the offline_bundle_cleanup.zip file attached to the KB and upload it to the SDDC Manager to the /home/vcf directory

2. SSH to the SDDC Manager as vcf and su to root

3. Unzip the uploaded zip file:

unzip offline_bundle_cleanup.zip

4. Update file permissions:

chmod 755 delta_file offline_bundle_cleanup.py

5. Execute the cleanup script:

python offline_bundle_cleanup.py delta_file

Note: If any Async Patch or Hot Patch was being applied, the upgrade bundle associated with it may need to be downloaded again to resume the upgrade/patching operations.

Warning: If script fails due to an error reporting that upgrade is in progress or cancelling, please contact VMware Support for assistance, as additional manual cleanup may be required in the SDDC Manager database.

Example of error message:

/opt/vmware/vcf/lcm/lcm-app/bin/bundle_cleanup.py 6dcb75f5-6fe8-463b-953e-9b0d92ee4116
-----------------------------------------------------
LOG FILE : /var/log/vmware/vcf/lcm/bundle_cleanup.log
-----------------------------------------------------
2023-11-09 15:28:39,044 [INFO] root: Performing cleanup for bundle with IDs : ['6dcb75f5-6fe8-463b-953e-9b0d92ee4116']
2023-11-09 15:28:39,067 [INFO] root: Bundle clean up can't be executed while upgrade is in progress or cancelling.

Attachments

offline_bundle_cleanup get_app