Corrupted vFAT partitions from ESXi 6.5/6.7 environment might cause upgrades to ESXi 7.0.x or ESXi 8.0 to fail
search cancel

Corrupted vFAT partitions from ESXi 6.5/6.7 environment might cause upgrades to ESXi 7.0.x or ESXi 8.0 to fail

book

Article ID: 345227

calendar_today

Updated On:

Products

VMware vSphere ESXi

Issue/Introduction

This KB is to recover from a corrupted one or more vFAT partitions which is preventing a re-partitioning on upgrading to ESXi 7.0.x or ESXi 8.0.


Symptoms:
Due to corrupted vFAT partitions,upgrades from ESXi 6.5 and 6.7 to  versions upto ESXi 7.0 Update 3k or ESXi 8.0c may show the following symptoms.
  • Logs such as ramdisk (root) is full in the vmkwarning.log file.
  • Unexpected rollback to ESXi 6.5 or 6.7 after a host was upgraded to ESXi 7.0.x or ESXi 8.0.
  • After a host upgrade bootbanks are not linked to (/bootbank and /altbootbank) and OSDATA is not present.
  • The backtrace from jumpstart-native-stdout.log will show the below errors.
2022-12-08 11:07:23,082 SystemStorage t10.ATA_____MTFDDAV240TCB___________________________________18331E1948DD: upgrading partition layout...
Traceback (most recent call last):
  File "/bin/initSystemStorage", line 1354, in <module>
    storage.setupSystemPartitions()
  File "/bin/initSystemStorage", line 659, in setupSystemPartitions
    self.upgradePartitionTable(bootDisk)
  File "/bin/initSystemStorage", line 413, in upgradePartitionTable
    upgradeBackup()
  File "/lib64/python3.8/site-packages/systemStorage/upgradeUtils.py", line 307, in upgradeBackup
  File "/lib64/python3.8/site-packages/systemStorage/upgradeUtils.py", line 201, in calculateDirMiBSize
  File "/lib64/python3.8/genericpath.py", line 50, in getsize
FileNotFoundError: [Errno 2] No such file or directory: '/vmfs/volumes/5d031f44-ee1c17aa-16e8-506b4b1c574e/log/\x03\x05\x03\x01yd\x1fy.\udce8g\udcdd'
2022-12-08T11:07:34.523Z Plugin system-storage failed Invoking method start (rc=1)
  • When upgrading to ESXi 7.0 Update 3l or ESXi 8.0 Update 1,the operation fails with a purple diagnostic screen and an error such as:
An error occurred while backing up VFAT partition files before re-partitioning: Failed to calculate size for temporary Ramdisk: <error>.
An error occurred while backing up VFAT partition files before re-partitioning: Failed to copy files to Ramdisk: <error>
.

NOTE:The preceding log excerpts are only examples.Date,time and environmental variables may vary depending on your environment.

Environment

VMware vSphere ESXi 8.0.x
VMware vSphere ESXi 7.0.x

Cause

The cause is under investigation.

Resolution

This is a known issue and currently there is no resolution.

Workaround:

Identify all vFAT partition 

  1. Each ESXi host has 4 vFAT partition on ESXi 6.5 and ESXi 6.7: 2 Bootbanks, Scratch, and Locker

    # esxcli storage filesystem list
    Mount Point                                        Volume Name  UUID                                 Mounted  Type            Size          Free
    -------------------------------------------------  -----------  -----------------------------------  -------  ------  ------------  ------------
    /vmfs/volumes/63fe1b2d-6e3a9dc9-e809-000c298546fa  datastore1   63fe1b2d-6e3a9dc9-e809-000c298546fa     true  VMFS-6  129385889792  127599116288
    /vmfs/volumes/63fe1b26-fdea7ff3-f520-000c298546fa               63fe1b26-fdea7ff3-f520-000c298546fa     true  vfat       299712512     108437504
    /vmfs/volumes/079b6e7e-bbef0920-3675-2a9602fe2ce5               079b6e7e-bbef0920-3675-2a9602fe2ce5     true  vfat       261853184      88797184
    /vmfs/volumes/63fe3b74-53874442-77fa-000c298546fa               63fe3b74-53874442-77fa-000c298546fa     true  vfat      4293591040    4079943680
    /vmfs/volumes/7ad83874-adf59f0f-e7d7-7e4e5b6505b5               7ad83874-adf59f0f-e7d7-7e4e5b6505b5     true  vfat       261853184     261849088
  2. From the mount points, its possible to identify disk and partition

    # vmkfstools -P /vmfs/volumes/63fe1b26-fdea7ff3-f520-000c298546fa
    vfat-0.04 (Raw Major Version: 0) file system spanning 1 partitions.
    File system label (if any):
    Mode: private
    Capacity 299712512 (36586 file blocks * 8192), 108437504 (13237 blocks) avail, max supported file size 0
    Disk Block Size: 512/0/0
    UUID: 63fe1b26-fdea7ff3-f520-000c298546fa
    Partitions spanned (on "disks"):
        mpx.vmhba0:C0:T0:L0:8
    Is Native Snapshot Capable: NO

    The disk and partition id is mpx.vmhba0:C0:T0:L0:8.
    Repeat this step for all vFAT partitions. Finally, you will have list like this

    1. mpx.vmhba0:C0:T0:L0:2 (scratch)
    2. mpx.vmhba0:C0:T0:L0:5 (bootbank 1)
    3. mpx.vmhba0:C0:T0:L0:6 (bootbank 2)
    4. mpx.vmhba0:C0:T0:L0:8 (locker)

Enter maintenance mode and stop all daemons

To avoid any interference between the following steps and any daemon writing on the disk, its required to check for open file handles and close them.

  1. Stop crond, which periodically schedules backup.sh, updating the active bootbank
    kill $(cat /var/run/crond.pid)
  2. Stop vmsyslogd, which has open file handles on /scratch (log files)
    /usr/lib/vmware/vmsyslog/bin/shutdown.sh
  3. Check for further daemons having open file handles on the scratch partition and stop these daemons

    # lsof |grep scratch
    1001391762  vmfstracegd           FILE                        4   /scratch/vmfstraces/vmfsGlobalTrace.trace.0.gz
     
    # /etc/init.d/vmfstraced stop
    watchdog-vmfstracegd: Terminating watchdog process with PID 1001391748
    vmfstracegd stopped
    [root@localhost:~] lsof |grep scratch
     
    -- note: 63fe3b74-53874442-77fa-000c298546fa is the UUID of the scratch partition
    # lsof |grep 63fe3b74-53874442-77fa-000c298546fa
    1001391489  rhttpproxy            FILE                       18   /vmfs/volumes/63fe3b74-53874442-77fa-000c298546fa/log/rhttpproxy-1001391489-000000db02450060-lo0-1.pcap
    1001391489  rhttpproxy            FILE                       19   /vmfs/volumes/63fe3b74-53874442-77fa-000c298546fa/log/rhttpproxy-1001391489-000000db024501a8-vmk0-1.pcap
    # /etc/init.d/rhttpproxy stop
     
    # lsof | grep var/run/log
    2101088    python               FILE                       5  /var/run/log/vsandevicemonitord.log
    
    # /etc/init.d/vsandevicemonitord stop

To resolve this issue follow one of below solutions.

Solution 1 (Preferred solution) -  Use dosfsck as a first solution

For all identifies vFAT partitions, check the file system integrity and repair the disk as needed

  1. Check the health of the vFAT partition
    1. dosfsck -Vv /dev/disks/<disk and partition id>
      disk and partition id was derived in the previous step
    2. For instance, the output for a healthy partition

      # dosfsck -Vv /dev/disks/mpx.vmhba0\:C0\:T0\:L0:2
      dosfsck 2.11 (12 Mar 2005)
      dosfsck 2.11, 12 Mar 2005, FAT32, LFN
      Checking we can access the last sector of the filesystem
      Boot sector contents:
      System ID "MSDOS5.0"
      Media byte 0xf8 (hard disk)
             512 bytes per logical sector
           65536 bytes per cluster
               2 reserved sectors
      First FAT starts at byte 1024 (sector 2)
               2 FATs, 16 bit entries
          131072 bytes per FAT (= 256 sectors)
      Root directory starts at byte 263168 (sector 514)
             512 root directory entries
      Data area starts at byte 279552 (sector 546)
           65515 data clusters (4293591040 bytes)
      32 sectors/track, 64 heads
               0 hidden sectors
         8386560 sectors total
      Starting check/repair pass.
      Checking for unused clusters.
      Starting verification pass.
      Checking for unused clusters.
      /dev/disks/mpx.vmhba0:C0:T0:L0:2: 222 files, 3279/65515 clusters
  2. If the command reports any failures or hangs, then try to repair the partition
    dosfsck -a -w /dev/disks/<disk and partition id>
  3. Repeat step 1. If dosfsck still report failures, proceed with the next step to re-create the partition
  4. After you have checked all ESXi partitions, reboot the ESXi host (this will restart all previously stopped daemons)

Solution 2 - Re-create a corrupted vFAT partition

  1. Backup all files. In this example, we will backup /scratch and keep a copy on datastore1
    cp /scratch/ /vmfs/volumes/datastore1/scratchBackup
    (At this point its very likely that the cp command returns a failure. Note, the filesystem is corrupted and one or more files or filenames will be invalid. A this point copy folder by folder or file by file and leave the corrupted files on the disk. After re-formatting, the  file will be lost!)

  2. (Re-)Format the corrupted partition

    # vmkfstools -C vfat /dev/disks/mpx.vmhba0:C0:T0:L0:2
    create fs deviceName:'/dev/disks/mpx.vmhba0:C0:T0:L0:2', fsShortName:'vfat', fsName:'(null)'
    deviceFullPath:/dev/disks/mpx.vmhba0:C0:T0:L0:2 deviceFile:mpx.vmhba0:C0:T0:L0:2
    Checking if remote hosts are using this device as a valid file system. This may take a few seconds...
    Creating vfat file system on "mpx.vmhba0:C0:T0:L0:2" with blockSize 1048576 and volume label "none".
    Successfully created new volume: 640748a7-86efb63c-0e53-000c298546fa

    (Note: If the command returns a busy error, this indicates that a file on this disk is still open. See above steps to identify the open handles.)

  3. Restore the content 

    1. Get the volume ID from the previous command (e.g., 640748a7-86efb63c-0e53-000c298546fa)
    2.  # cp -r /vmfs/volumes/datastore1/scratchBackup/* /vmfs/volumes/640748a7-86efb63c-0e53-000c298546fa/
  4. Reboot the ESXi host, after you have checked and repaired all vFAT partitions.


Additional Information

Impact/Risks:

None