Non Disruptive Upgrade (NDU) of EMC XtremeIO Arrays results in LUN connectivity lost

Products

VMware

Issue/Introduction

Symptoms:
In an NMP environment, during a Non Disruptive Upgrade (NDU) of EMC ExtremeIO Arrays, you experience these symptoms:

There is lost connectivity to LUNs with All Paths Down (APD) and Permanent Device Loss (PDL) conditions.
In the /var/log/vmkernel.log file, you see messages showing TUR failing with a PDL on all paths to the device similar to:

2018-01-22T05:43:56.873Z cpu1:69146)WARNING: NMP: vmk_NmpSatpIssueTUR:1043: Device naa.514f0c5b0f800005 path vmhba3:C0:T5:L1 has been unmapped from the array
2018-01-22T05:43:56.873Z cpu2:69146)WARNING: NMP: vmk_NmpSatpIssueTUR:1043: Device naa.514f0c5b0f800005 path vmhba3:C0:T4:L1 has been unmapped from the array
When TUR is failing with PDL and all the paths are down, but the paths are not removed and marked as PDL. In the vobd.log file, you see messages showing that the device lost connectivity at an earlier time and should have been marked as PDL similar to:

2018-01-22T05:29:54.930Z: [scsiCorrelator] 53521051781us: [esx.problem.storage.redundancy.degraded] Path redundancy to storage device naa.514f0c5b0f800005 degraded. Path vmhba2:C0:T5:L1 is down. Affected datastores: "6.0U3_DS1".
2018-01-22T05:43:55.915Z: [scsiCorrelator] 54362036537us: [esx.problem.storage.connectivity.lost] Lost connectivity to storage device naa.514f0c5b0f800005. Path vmhba2:C0:T4:L1 is down. Affected datastores: "6.0U3_DS1".
2018-01-22T05:43:55.915Z: [scsiCorrelator] 54362035359us: [vob.scsi.device.state.permanentloss] Device :naa.514f0c5b0f800005 has been removed or is permanently inaccessible.
2018-01-22T05:43:55.916Z: [scsiCorrelator] 54362036985us: [esx.problem.scsi.device.state.permanentloss] Device: naa.514f0c5b0f800005 has been removed or is permanently inaccessible. Affected datastores (if any): "6.0U3_DS1".

Note: The preceding log excerpts are only examples. Date, time, and environmental variables may vary depending on your environment.

Cause

This issue occurs because the storage path is marked as Permanent Device Loss (PDL) even before the TUR (from the Array) has marked the path as PDL.

A PDL is returned from either the target (array) or the firmware on the controller. Some changes/regressions might also cause a PDL to be returned, even if it was marked as a PDL from the Array/Controller. In this situation, any further I/O on the same path is FAST PDL'ed on the VMware layer as those paths would no longer be usable or active. This is exactly what happens during a Non Disruptive Upgrade (NDU).

Resolution

This issue is resolved in:

VMware ESXi 6.0 Patch Release ESXi600-201807001 (Build 9239799), available at VMware Patch Downloads.
VMware ESXi 6.5 Patch Release ESXi650-201811002 (Build 17477841), available at VMware Patch Downloads.
VMware ESXi 6.7 Patch Release ESXi670-201912001 (Build 15160138), available at VMware Patch Downloads.

For more information on downloading patch, see How to download patches in Customer Connect (1021623).

Workaround:
To work around this issue if you do not want to upgrade, perform these steps while performing NDU:

Upgrade Controller 1.
To get the dead paths active, run this command:

esxcfg-rescan -d

Note: In case there is any issue with the above command, use the esxcfg-rescan -A command instead.
Upgrade Controller 2.
To get the dead paths active, run this command:

esxcfg-rescan -d

Notes:

In case there is any issue with the above command, use the esxcfg-rescan -A instead.
The commands esxcfg-rescan -d and esxcfg-rescan -A are to be done alternately after each controller upgrade.

Additional Information

For more insight into TUR and PDL specific events, see: