ESXi host takes a long time to start during rescan of RDM LUNs

Products

VMware vSphere ESXi

Issue/Introduction

Symptoms:
When RDMs are used as shared disk resources for a clustering solution such as WSFC, Red Hat High Availability Cluster, etc., you experience these symptoms:

ESXi hosts hosting secondary nodes may take a long time to start. This time depends on the number of RDMs that are attached to the ESXi host..

Note: Example, In a system with 10 RDMs used in a two-node WSFC or Red Hat High Availability Cluster, restart of the ESXi host with the secondary node may take upto 30 minutes. In a system with less RDMs, the restart time is less. For example, if only three RDMs are used, the restart time is approximately 10 minutes.
The ESXi host intermittently displays an error message on the Summary Tab and the vSphere Client may not be able to start:

Cannot synchronize host hostname. Operation Timed out.
The log in screen shows the start waiting after this message similar to:

Loading module multiextent.

Environment

VMware vSphere ESXi 6.0
VMware vSphere ESXi 6.7
VMware vSphere ESXi 7.0.0
VMware vSphere ESXi 6.5
VMware vSphere ESXi 8.0

Cause

This issue occurs when virtual machines participating in a clustering solution such as WSFC, Red Hat High Availability Cluster use shared RDMs and SCSI reservations across hosts, and a virtual machine on the other host is the active cluster node holding a SCSI Reservation.

The delay occurs at these steps:

Starting path claiming and SCSI device discovery

In the /var/log/vmkernel.log file of the restarting ESXi host, you see entries similar to:

vmkernel: 0:00:01:57.828 cpu0:4096)WARNING: ScsiCore: 1353: Power-on Reset occurred on naa.6006016045502500176a24d34fbbdf11
vmkernel: 0:00:01:57.830 cpu0:4096)VMNIX: VmkDev: 2122: Added SCSI device vml0:3:0 (naa.6006016045502500166a24d34fbbdf11)
vmkernel: 0:00:02:37.842 cpu3:4099)ScsiDeviceIO: 1672: Command 0x1a to device "naa.6006016045502500176a24d34fbbdf11" failed H:0x5 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0

Mounting the partition of the RDM LUNs

In the /var/log/vmkernel.log file of the restarting ESXi host, you see entries similar to:

vmkernel: 0:00:08:58.811 cpu2:4098)WARNING: ScsiCore: 1353: Power-on Reset occurred on naa.600601604550250083489d914fbbdf11
vmkernel: 0:00:08:58.814 cpu0:4096)VMNIX: VmkDev: 2122: Added SCSI device vml0:9:0 (naa.600601604550250082489d914fbbdf11)
vmkernel: 0:00:09:38.855 cpu2:4098)ScsiDeviceIO: 1672: Command 0x1a to device "naa.600601604550250083489d914fbbdf11" failed H:0x5 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0.
vmkernel: 0:00:09:38.855 cpu1:4111)ScsiDeviceIO: 4494: Could not detect setting of QErr for device naa.600601604550250083489d914fbbdf11. Error Failure.
vmkernel: 0:00:10:08.945 cpu1:4111)WARNING: Partition: 801: Partition table read from device naa.600601604550250083489d914fbbdf11 failed: I/O error
vmkernel: 0:00:10:08.945 cpu1:4111)ScsiDevice: 2200: Successfully registered device "naa.600601604550250083489d914fbbdf11" from plugin "NMP" of type 0
vmkernel: 47:02:52:19.382 cpu17:9624)WARNING: NMP: nmp_IsSupportedPResvCommand: Unsupported Persistent Reservation Command,service action 0 type 4
vmkernel: 47:02:52:19.383 cpu12:4108)WARNING: NMP: nmpUpdatePResvStateSuccess: Parameter List Length 54310000 for service action 0 is beyondthe supported value 18
vmkernel: 47:02:52:21.383 cpu23:9621)WARNING: NMP: nmp_IsSupportedPResvCommand: Unsupported Persistent Reservation Command,service action 0 type 4

If you configure the setting on an existing VMFS LUN, you may see these entries in the /var/log/vmkernel.log file:

cpu4:10169)WARNING: Partition: 1273: Device "naa.XXXXXXXXXXXXXXXXXXXxxxxxxxxxxxxx" with a VMFS partition is marked perennially reserved. This is not supported and may lead to data loss.
You can safely ignore this warning for Clustered VMDK datastores. VMware Engineering is working on suppressing the message in future release.

Resolution

ESXi 6.x and ESXi 7.x Hosts

For all ESXi 6.x and 7.x hosts, the command line, vSphere Client, and PowerCLI methods of setting the RDMs to perennially reserved are covered in the sections below:

To mark the LUNs as perennially reserved:

Determine which RDM LUNs are part of WSFC, Red Hat High Availability Cluster etc . From the vSphere Client, select a virtual machine that has a mapping to the cluster RDM devices.
Edit your virtual machine settings and navigate to your Mapped RAW LUNs. In this example, Hard disk 2:
In the Physical disk, there is the specification of the device in use as RDM (that is, the VML ID).

Take note of the VML ID, which is a globally unique identifier for your shared device.
Identify the naa.id for this VML using this command: esxcli storage core device list

For example:

esxcli storage core device list

naa.6589cfc000000a17ac02aae02067e747
   Display Name: FreeNAS iSCSI Disk (naa.6589cfc000000a17ac02aae02067e747)
   Has Settable Display Name: true
   Size: 40960
   Device Type: Direct-Access
   Multipath Plugin: NMP
   Devfs Path: /vmfs/devices/disks/naa.6589cfc000000a17ac02aae02067e747
   Vendor: FreeNAS
   Model: iSCSI Disk
   Revision: 0123
   SCSI Level: 6
   Is Pseudo: false
   Status: degraded
   Is RDM Capable: true
   Is Local: false
   Is Removable: false
   Is SSD: false
   Is VVOL PE: false
   Is Offline: false
   Is Perennially Reserved: false
   Queue Full Sample Size: 0
   Queue Full Threshold: 0
   Thin Provisioning Status: unknown
   Attached Filters:
   VAAI Status: supported
   Other UIDs: vml.010001000030303530353630313031303830310000695343534920
   Is Shared Clusterwide: true
   Is SAS: false
   Is USB: false
   Is Boot Device: false
   Device Max Queue Depth: 128
   No of outstanding IOs with competing worlds: 32
   Drive Type: unknown
   RAID Level: unknown
   Number of Physical Drives: unknown
   Protection Enabled: false
   PI Activated: false
   PI Type: 0
   PI Protection Mask: NO PROTECTION
   Supported Guard Types: NO GUARD SUPPORT
   DIX Enabled: false
   DIX Guard Type: NO GUARD SUPPORT
   Emulated DIX/DIF Enabled: false
Use the esxcli command to mark the device as perennially reserved:

esxcli storage core device setconfig -d naa.id --perennially-reserved=true

For example:

esxcli storage core device setconfig -d naa.6589cfc000000a17ac02aae02067e747 --perennially-reserved=true

Note: For vSphere 7.x, see the Change Perennial Reservation Settings section of the vSphere Storage Guide.
To verify that the device is perennially reserved, run this command:

esxcli storage core device list -d naa.id

In the output of the esxcli command, search for the entry Is Perennially Reserved: true. This shows that the device is marked as perennially reserved.

For example:

esxcli storage core device list -d naa.6589cfc000000a17ac02aae02067e747

naa.6589cfc000000a17ac02aae02067e747
   Display Name: FreeNAS iSCSI Disk (naa.6589cfc000000a17ac02aae02067e747)
   Has Settable Display Name: true
   Size: 40960
   Device Type: Direct-Access
   Multipath Plugin: NMP
   Devfs Path: /vmfs/devices/disks/naa.6589cfc000000a17ac02aae02067e747
   Vendor: FreeNAS
   Model: iSCSI Disk
   Revision: 0123
   SCSI Level: 6
   Is Pseudo: false
   Status: degraded
   Is RDM Capable: true
   Is Local: false
   Is Removable: false
   Is SSD: false
   Is VVOL PE: false
   Is Offline: false
   Is Perennially Reserved: true
   Queue Full Sample Size: 0
   Queue Full Threshold: 0
   Thin Provisioning Status: unknown
   Attached Filters:
   VAAI Status: supported
   Other UIDs: vml.010001000030303530353630313031303830310000695343534920
   Is Shared Clusterwide: true
   Is SAS: false
   Is USB: false
   Is Boot Device: false
   Device Max Queue Depth: 128
   No of outstanding IOs with competing worlds: 32
   Drive Type: unknown
   RAID Level: unknown
   Number of Physical Drives: unknown
   Protection Enabled: false
   PI Activated: false
   PI Type: 0
   PI Protection Mask: NO PROTECTION
   Supported Guard Types: NO GUARD SUPPORT
   DIX Enabled: false
   DIX Guard Type: NO GUARD SUPPORT
   Emulated DIX/DIF Enabled: false
Repeat the procedure for each Mapped RAW LUN that is participating in the clustering solution such as WSFC, Red Hat High Availability Cluster, etc.

Note: The configuration is permanently stored with the ESXi host and persists across restarts. To remove the perennially reserved flag, run this command:

esxcli storage core device setconfig -d naa.id --perennially-reserved=false

Additional Information

VMware Skyline Health Diagnostics for vSphere - FAQ
How to detach a LUN device from ESXi hosts

For more information, see Obtaining LUN pathing information for ESX or ESXi hosts (1003973).

Note: The PowerCLI and esxcli commands are case sensitive. If the naa.id is specified in uppercase letters when issuing the command, a new device is added on the ESXi host.

The resolution steps in this article are also known to resolve storage devices reporting NMP errors similar to:

WARNING: NMP: nmp_DeviceRetryCommand:133:Device "naa.600601604ec0360065efeed9d265e411": awaiting fast path state update for failover with I/O blocked. No prior reservation exists on the device.

For more information, see:

If you experience symptoms described above for Clustered VMDK, follow the steps to resolve the issue.