vSAN Deduplication enabled -- Identifying Failed Disk
search cancel

vSAN Deduplication enabled -- Identifying Failed Disk

book

Article ID: 327008

calendar_today

Updated On:

Products

VMware vSAN

Issue/Introduction

Symptoms:
When using VMware vSAN with Deduplication enabled, any Disk failure will result in the failure of the entire Disk group it belongs to. 
The related vSAN Healthcheck "Operation health" will reflect that the entire Disk Group is offline.
health2.jpg
This is further demonstrated via Configure > vSAN > Disk Management:

host.jpg

As we can see from the Screenshots, we have one Disk which is marked as "Absent".
Due to the nature of the event, we only see the Disk UUID but not the original Disk name anymore (= e.g. naa.xxxxx).

See here for additional information:
Using Deduplication and Compression




Environment

VMware vSAN 6.7.x
VMware vSAN 6.5.x
VMware vSAN 8.0.x
VMware vSAN 6.2.x
VMware vSAN 6.6.x
VMware vSAN 7.0.x

Cause

vSAN deduplication occurs at the disk group level cluster wide.  As a result, the failure of a single disk in the disk group results in the failure of the entire disk group. The UI reflects this disk group failure but does not reveal the device identifying information about the device that triggered the disk group failure.

Resolution

To identify the specific device that caused the failure:
1. Log in to the applicable ESXi host via SSH or KVM/physical console.
2. List vSAN disks using this command:

# esxcli vsan storage list

3. You will see output like this for a failed disk
Unknown:
   Device: Unknown
   Display Name: Unknown
   Is SSD: false
   VSAN UUID: 52703402-bfd2-9261-c40c-16d93dce226a
   VSAN Disk Group UUID:
   VSAN Disk Group Name:
   Used by this host: false
   In CMMDS: false
   On-disk format version: -1
   Deduplication: false
   Compression: false
   Checksum:
   Checksum OK: false
   Is Capacity Tier: false
   Encryption Metadata Checksum OK: true
   Encryption: false
   DiskKeyLoaded: false
   Is Mounted: false
   Creation Time: Unknown

4. You can also use the command vdq -iH to list the disk mappings on the host to find the failed disk. If the disk is listed as a UUID and not the disk identifier then vSAN has failed out the disk as seen below:
[root@esx01:~] vdq -iH
Mappings:
   DiskMapping[0]:
           SSD:  naa.58ce38ee2031fec5
            MD:  naa.58ce38ee2019a7f9
            MD:  naa.58ce38ee201bbbd1
            MD:  naa.58ce38ee201b02a5
            MD:  naa.58ce38ee201b9d69
            MD:  naa.58ce38ee2019aaf5
            MD:  naa.58ce38ee2019a7e5
            MD:  52703402-bfd2-9261-c40c-16d93dce226a

5. To identify the display name of the disk and if the failure is recent enough run the following command:
grep 52703402-bfd2-9261-c40c-16d93dce226a /var/log/vmkernel.log
you should see similar output as below:
2021-01-09T05:45:41.638Z cpu0:7053521)LSOM: LSOMLogDiskEvent:7509: Disk Event permanent error propagated for MD 52703402-bfd2-9261-c40c-16d93dce226a (naa.58ce38ee2063aad9:2)
 
Note: The Disk Group must be removed first with the option "No Data migration"
(as the Disk Group is effectively lost), then replace the failed disk and re-create the Disk Group.


Additional Information

If necessary, we can get the path information about the failed device to further assist with identification.
From the ESXi Shell, run this command:

# esxcfg-mpath -bd <naa iddentifier device>

For the example in the Resolution section, the command and example output is:

# esxcfg-mpath -bd naa.6000c29c53fc02afe598901871729854
naa.6000c29c53fc02afe598901871729854 : VMware Serial Attached SCSI Disk (naa.6000c29c53fc02afe598901871729854)
vmhba1:C0:T1:L0 LUN:0 state:active sas Adapter: 5005056f7c188c11 Target: 5000c29c53fc02af


The device is target #1 on vmhba1.

We can also get the physical location of the device.
From the ESXi Shell, run these commands:

# esxcli storage core device physical get -d <naa iddentifier device>
# esxcli storage core device raid list -d <naa iddentifier device>


The command and example output is:

# esxcli storage core device physical get -d naa.6000c29c53fc02afe5989018717291bb
 Physical Location: enclosure 2, slot 5


Or 

# esxcli storage core device raid list -d naa.6000c29c53fc02afe5989018717291bb
 Physical Location: enclosure 2, slot 5



See here for additional information:
Turn Locator LEDs on vSAN storage devices on/off
With Deduplication & Compression enabled: Adding or Removing Disks
Remove Disk Groups or Devices from vSAN
Working with Individual Devices


Translated versions of this article:
确定 vSAN 去重群集中的具体磁盘故障
vSAN 重複排除クラスタでの特定のディスク障害の特定