After upgrading to ESXi 7.0U2, corruption can occur on VMFS datastores if the ESXi hosts sharing those LUNs had their boot devices cloned
search cancel

After upgrading to ESXi 7.0U2, corruption can occur on VMFS datastores if the ESXi hosts sharing those LUNs had their boot devices cloned

book

Article ID: 318630

calendar_today

Updated On:

Products

VMware vSphere ESXi

Issue/Introduction

Symptoms:
In /var/log/vmkernel.log, the follow event is observed:

2021-06-08T16:35:36.767Z cpu35:2100630)HBX: 6548: 'DatastoreA': HB at offset 3445760 - Marking HB:
2021-06-08T16:35:36.767Z cpu35:2100630)  [HB state abcdef02 offset 3445760 gen 43 stampUS 3883595347 uuid 60bf8d52-4b27a786-c787-0025b5920a03 jrnl <FB 6210136> drv 14.81 lockImpl 4 ip 10.8.68.206]
2021-06-08T16:35:36.767Z cpu35:2100630)HBX: 6552: HB at 3445760 on vol 'DatastoreA' replayHostHB: 0 replayHostHBgen: 0 replayHostUUID:  (00000000-00000000-0000-000000000000).
2021-06-08T16:35:36.767Z cpu6:2100630)HBX: 6667: 'DatastoreA': HB at offset 3445760 - Marked HB:
2021-06-08T16:35:36.767Z cpu6:2100630)  [HB state abcdef04 offset 3445760 gen 43 stampUS 807501607 uuid 60bf8d52-4b27a786-c787-0025b5920a03 jrnl <FB 6210136> drv 14.81 lockImpl 4 ip 10.8.68.206]
2021-06-08T16:35:36.767Z cpu6:2100630)FS3J: 4381: Replaying journal at <type 1 addr 6210136>, gen 43
2021-06-08T16:35:36.775Z cpu2:2100630)HBX: 6548: 'DatastoreA': HB at offset 3445248 - Marking HB:
2021-06-08T16:35:36.775Z cpu2:2100630)  [HB state abcdef02 offset 3445248 gen 25 stampUS 3094347003 uuid 60bf9069-423d1f92-d567-0025b5920a03 jrnl <FB 6211488> drv 14.81 lockImpl 4 ip 10.8.68.205]
2021-06-08T16:35:36.775Z cpu2:2100630)HBX: 6552: HB at 3445248 on vol 'DatastoreA' replayHostHB: 0 replayHostHBgen: 0 replayHostUUID:  (00000000-00000000-0000-000000000000).
2021-06-08T16:35:36.775Z cpu2:2100630)HBX: 6667: 'DatastoreA': HB at offset 3445248 - Marked HB:
2021-06-08T16:35:36.775Z cpu2:2100630)  [HB state abcdef04 offset 3445248 gen 25 stampUS 807509516 uuid 60bf9069-423d1f92-d567-0025b5920a03 jrnl <FB 6211488> drv 14.81 lockImpl 4 ip 10.8.68.205]
2021-06-08T16:35:36.775Z cpu2:2100630)FS3J: 4381: Replaying journal at <type 1 addr 6211488>, gen 25
2021-06-08T16:35:36.784Z cpu2:2100630)HBX: 6548: 'DatastoreA': HB at offset 3444736 - Marking HB:
2021-06-08T16:35:36.784Z cpu2:2100630)  [HB state abcdef02 offset 3444736 gen 145 stampUS 145354135 uuid 60bf9bed-3092e266-2e35-0025b5920a03 jrnl <FB 6210136> drv 14.81 lockImpl 4 ip 10.8.68.207]
2021-06-08T16:35:36.784Z cpu2:2100630)HBX: 6552: HB at 3444736 on vol 'DatastoreA' replayHostHB: 0 replayHostHBgen: 0 replayHostUUID:  (00000000-00000000-0000-000000000000).
2021-06-08T16:35:36.784Z cpu2:2100630)HBX: 6667: 'DatastoreA': HB at offset 3444736 - Marked HB:
2021-06-08T16:35:36.784Z cpu2:2100630)  [HB state abcdef04 offset 3444736 gen 145 stampUS 807518598 uuid 60bf9bed-3092e266-2e35-0025b5920a03 jrnl <FB 6210136> drv 14.81 lockImpl 4 ip 10.8.68.207]
2021-06-08T16:35:36.784Z cpu2:2100630)FS3J: 4381: Replaying journal at <type 1 addr 6210136>, gen 145
2021-06-08T16:35:36.796Z cpu18:2100630)HBX: 4720: 3 stale HB slot(s) owned by me have been garbage collected on vol 'DatastoreA'
2021-06-08T16:35:36.802Z cpu18:2100630)WARNING: FS3: 608: VMFS volume DatastoreA/59de4594-ceca2bd1-1832-0025b5920a02 on naa.xxxxxxxxxxxxxxx:1 has been detected corrupted
2021-06-08T16:35:36.802Z cpu18:2100630)FS3: 610: While filing a PR, please report the names of all hosts that attach to this LUN, tests that were running on them,
2021-06-08T16:35:36.802Z cpu18:2100630)FS3: 638: and upload the dump by `dd if=/vmfs/devices/disks/naa.xxxxxxxxxxxxxxx:1 of=X bs=1M count=1200 conv=notrunc`
2021-06-08T16:35:36.802Z cpu18:2100630)FS3: 641: where X is the dump file name on a DIFFERENT volume
2021-06-08T16:35:36.802Z cpu18:2100630)FS3: 319: FS3RCMeta 3881 200 1 67 0
2021-06-08T16:35:36.802Z cpu18:2100630)FS3: 326: 0 0 0 0 0 0 0 0 8 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2021-06-08T16:35:36.802Z cpu18:2100630)FS3: 332: 0 0 0
2021-06-08T16:35:36.802Z cpu18:2100630)FS3: 338: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2021-06-08T16:35:36.802Z cpu18:2100630)FS3: 346: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2021-06-08T16:35:36.802Z cpu18:2100630)FS3: 346: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2021-06-08T16:35:36.802Z cpu18:2100630)FS3: 346: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2021-06-08T16:35:36.802Z cpu18:2100630)FS3: 346: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2021-06-08T16:35:36.802Z cpu18:2100630)FS3: 346: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2021-06-08T16:35:36.802Z cpu18:2100630)FS3: 346: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2021-06-08T16:35:36.802Z cpu18:2100630)FS3: 346: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2021-06-08T16:35:36.802Z cpu18:2100630)FS3: 346: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2021-06-08T16:35:36.802Z cpu18:2100630)WARNING: FS3J: 2240: Error freeing journal block <FBA tbz 0 cow 0 blk 776267> (returned 0) for 59de4594-ceca2bd1-1832-0025b5920a02: Invalid metadata
2021-06-08T16:35:36.802Z cpu18:2100630)WARNING: HBX: 3820: Cannot free journal <type 1 addr 6210136> on vol 'DatastoreA'


Environment

VMware vSphere ESXi 7.0.0
VMware vSphere ESXi 7.0.x

Cause

When an ESXi boot device is cloned, the System Universal Unique Identifier (UUID) is also cloned. This identifier is used for VMFS Heartbeat and Journal operations so if multiple hosts have the same UUID, this case lead to a split-brain situation as the ESXi hosts will attempt to access each other's metadata regions on VMFS. The most common form of cloned ESXi boot devices is cloned boot LUNs for rapid deployments.

Resolution

Cloning ESXi boot devices is not supported. While this may have worked successfully in previous versions of ESXi, there are additional dependencies on the System UUID being unique on ESXi 7.0 U2 and moving forward. See the official statement about cloned ESXi boot device supportability here: https://kb.vmware.com/s/article/84280

Workaround:
If you are running ESXi hosts that have cloned boot devices in your environment, there is a 4 step process to change the System UUID on each server so that it will be unique. It should be noted that this process will only work on hosts that have not been upgraded to ESXi 7.0 U2 yet. If hosts have already been upgraded to 7.0 U2 then the only supported solution is rebuild those hosts.
 Note: This will not work on the original host with the correct MAC address in the UUID.

1. There is an advanced ESXi setting called FollowHardwareMac that will automatically update the VMkernel's MAC Address whenever the network adapter MAC Addresses changes. To do so, you will need to run the following ESXCLI command:
 
esxcli system settings advanced set -o /Net/FollowHardwareMac -i 1
 
2. The other modification that is required is to delete the existing System UUID entry in /etc/vmware/esx.conf configuration file. This will ensure a new System UUID will automatically be generated when the system boots up and written to this file. To do so, open esx.conf and delete the entire /system/uuid line entry and then save the file. Alternatively, here is a one-liner you can run without needing to open up VI in order to clear this line in the config file:
 Note: Run this exactly. ## is part of the command and does not represent numbers.

sed -i 's#/system/uuid.*##' /etc/vmware/esx.conf
 
3. To ensure that this change is persisted, run the following command:

/sbin/auto-backup.sh

4. Reboot the ESXi host to generate the new System UUID. It is important to verify that the System UUID has actually changed from the original.

Note: All datastores affected by corruption will need to be reformatted to clear the corruption. This should be done AFTER changing the UUID on ALL ESXi hosts otherwise corruption will continue.