Virtual machine disk hung during the vSphere replication and the issue happens with the VMware tools 11.x and above
search cancel

Virtual machine disk hung during the vSphere replication and the issue happens with the VMware tools 11.x and above

book

Article ID: 318548

calendar_today

Updated On:

Products

VMware VMware vSphere ESXi

Issue/Introduction

This issue observed during replication it might be vSphere replication or from 3rd party replication for example Veeam.

Symptoms:
After upgrading the tools to 11.x version the VM disks going in-accessible during SRM replication.

Every time when the replication for the Virtual machine is initiated the data disk gets hung and won’t be accessible until replication is stopped.

If the virtual machine running on 10.2.x/10.3.x tools version the issue is not observed


In the windows event logs PVSCSI errors can be seen “Event ID 129 Reset to device, \Device\RaidPort0, was issued.”


vmware.log
==========
2020-05-25T16:12:09.806Z| vcpu-0| I125: HBACommon: First write on scsi1:0.fileName='/vmfs/volumes/5a820ccb-4877c310-8c85-0025b510008f/PRDFILE02/PRDFILE02_1.vmdk'
2020-05-25T16:12:09.806Z| vcpu-0| I125: DDB: "longContentID" = "705076b4427e484b7ced8a6850124967" (was "81c0dd8ae32a1bde6e4e467a42a5825c")
2020-05-25T16:12:09.834Z| vcpu-0| I125: DISKLIB-CHAIN : DiskChainUpdateContentID: old=0x42a5825c, new=0x50124967 (705076b4427e484b7ced8a6850124967)
2020-05-25T16:12:09.837Z| vcpu-0| I125: HBACommon: First write on scsi0:0.fileName='/vmfs/volumes/5a820ccb-4877c310-8c85-0025b510008f/PRDFILE02/PRDFILE02.vmdk'
2020-05-25T16:12:09.837Z| vcpu-0| I125: DDB: "longContentID" = "b8a1c4166b9f01522c15e842554f92be" (was "46aeb14f47edd43b3745749f3b36b977")
2020-05-25T16:12:09.866Z| vcpu-0| I125: DISKLIB-CHAIN : DiskChainUpdateContentID: old=0x3b36b977, new=0x554f92be (b8a1c4166b9f01522c15e842554f92be)
2020-05-25T16:12:10.090Z| vcpu-1| I125: HBACommon: First write on scsi0:2.fileName='/vmfs/volumes/5a820ccb-4877c310-8c85-0025b510008f/PRDFILE02/PRDFILE02_2.vmdk'
2020-05-25T16:12:10.090Z| vcpu-1| I125: DDB: "longContentID" = "c9b68f14469645722baec2997fabb24a" (was "d8e63ac44722b21faf97234fe760d97a")
2020-05-25T16:12:10.118Z| vcpu-1| I125: DISKLIB-CHAIN : DiskChainUpdateContentID: old=0xe760d97a, new=0x7fabb24a (c9b68f14469645722baec2997fabb24a)
2020-05-25T16:12:55.393Z| vmx| I125: GuestRpcSendTimedOut: message to toolbox timed out.
2020-05-25T16:12:55.393Z| vmx| I125: GuestRpcSendTimedOut: message to toolbox timed out.
2020-05-25T16:13:15.397Z| vmx| I125: GuestRpcSendTimedOut: message to toolbox timed out.
2020-05-25T16:13:15.397Z| vmx| I125: GuestRpc: app toolbox's second ping timeout; assuming app is down

vmkernel.log (on the host where the VM is running)
============
2020-05-25T16:12:25.687Z cpu1:5350368)Hbr: 1198: File hbrtmp.1.6660 (groupID=GID-e674df0b-8408-447f-9645-205f8e83c1d6) (offset=0) already exists on server and is identical
2020-05-25T16:12:25.688Z cpu1:5350368)Hbr: 1198: File hbrtmp.2.41 (groupID=GID-e674df0b-8408-447f-9645-205f8e83c1d6) (offset=0) already exists on server and is identical
2020-05-25T16:12:25.690Z cpu7:2099708)Hbr: 2250: Prepared delta (diskID=RDID-0854baae-7c7f-4d57-8294-dae420f227b5) (numExtentsToTransfer=323760)
2020-05-25T16:12:25.690Z cpu7:2099708)Hbr: 2250: Prepared delta (diskID=RDID-b7c39039-4d2d-48fd-a33e-dcf0ce9be6b1) (numExtentsToTransfer=7852153)
2020-05-25T16:12:25.690Z cpu7:2099708)Hbr: 2250: Prepared delta (diskID=RDID-077c07b3-18ff-4605-8574-1be8fea3eb52) (numExtentsToTransfer=22991067)
2020-05-25T16:13:18.307Z cpu24:5350431)J6: 2651: 'Prdsan05_56_PRDDB10_Log_SCUReport': Exiting async journal replay manager world

Enable debug log-in for VMware tools 

[2020-05-25T16:20:26.744Z] [   debug] [vmsvc] VMTools_ConfigGetBoolean:Returning default value for '[guestinfo] diskinfo-report-uuid'=TRUE (Not founderr=4).
[2020-05-25T16:20:26.744Z] [   debug] [guestinfo] GetVolumeUUID:'\\?\Volume{b78c6c13-0000-0000-0000-100000000000}' is SCSI
[2020-05-25T16:20:26.744Z] [   debug] [guestinfo] GetVolumeUUID:SerialNumberOffset(0) isn't valid
[2020-05-25T16:20:27.322Z] [    info] [vmsvc] tools service recovered from a hang.
[2020-05-25T16:20:32.369Z] [    info] [vmsvc] tools hang detector time sequence 1.00s, 1.02s, 1.01s, 1.02s, 1.00s.
[2020-05-25T16:20:32.369Z] [    info] [vmsvc] tools service hung.
[2020-05-25T16:21:25.309Z] [   debug] [guestinfo] GetVolumeUUID:'\\?\Volume{0b2b2d1a-47c0-11ea-a99b-0050568997d6}' is SCSI
[2020-05-25T16:21:25.309Z] [   debug] [guestinfo] GetVolumeUUID:SerialNumberOffset(0) isn't valid
[2020-05-25T16:21:25.309Z] [   debug] [guestinfo] GetVolumeUUID:'\\?\Volume{455cd15a-0000-0000-0000-501f00000000}' is SCSI
[2020-05-25T16:21:25.309Z] [   debug] [guestinfo] GetVolumeUUID:SerialNumberOffset(0) isn't valid


Environment

VMware vSphere ESXi 6.7
VMware Tools 11.x

Cause

During the replication the drive is getting stuck trying to get the disk UUID's and hits a 60 second delay timeout.

Resolution

The permanent fix for this will be in 11.1.5 version download here

Workaround:
We can try the workaround while using 11.x Tools.

In tools.conf inside the affected guest, put

[guestinfo]
diskinfo-report-uuid=false

vmtoolsd will pick the change up within 5 seconds; no need to restart.