vCenter 6.7 backup fails with ERROR: Timeout! Failed to complete in 72000 seconds after stuck at 95%

search cancel

vCenter 6.7 backup fails with ERROR: Timeout! Failed to complete in 72000 seconds after stuck at 95%

book

Article ID: 318486

calendar_today

Updated On:

Products

VMware vCenter Server

Issue/Introduction

Symptoms:
vCenter database is greater than 300 GB. Verify through command line on the VCSA or the VAMI.

Run df -h on the VCSA command line.
Log into the VAMI and select monitor -> Disks

Backup progress is stuck at 95% in VAMI

BackupManager.py process is in a sleeping state. To confirm, run the below steps:

Collect the PID's for BackupManager.py

root@vc-prod [~]# ps -eaf | grep "backup"
root 13443 1844 2 Jul23 ? 00:22:45 /usr/bin/python /usr/lib/applmgmt/backup_restore/py/vmware/appliance/backup_restore/BackupManager.py
root 13473 13443 0 Jul23 ? 00:00:00 /usr/bin/python /usr/lib/applmgmt/backup_restore/py/vmware/appliance/backup_restore/BackupManager.py

Confirm the process state for PID's found in the first step are sleeping.

root@vc-prod [~]# cat /proc/13443/status
Name: python
State: S (sleeping)
Tgid: 13473
Ngid: 0
Pid: 13473
PPid: 13443
<snip>

Environment

VMware vCenter Server Appliance 6.7.x

Cause

Under certain conditions a deadlock can happen in the backup and restore process.

Resolution

This issue is resolved in vCenter Server 6.7 U3j, available at VMware Downloads.
This issue is resolved in vCenter Server 7.0, available at VMware Downloads.

Workaround:

Take a backup of the file /usr/lib/applmgmt/backup_restore/py/vmware/appliance/backup_restore/util/Proc.py
Modify the file by moving lines 140 and 141:

140 procRecord.process.join(timeout=timeout)

141 procRecord.joined = True

After line 144:

144 procRecord.status = procRecord.process.statusQ.get(False)

Restart applmgmt service (service-control --restart applmgmt)
Start backup process.

Feedback

thumb_up Yes

thumb_down No