Issuing a 0x85 SCSI command from a VMware ESXi 6.0 host with the EMC XtremIO storage array may result in a PDL error

Products

VMware vSphere ESXi

Issue/Introduction

Symptoms:
When a VMware vSphere ESXi 6.0 host requests SMART data from a EMC XtremIO storage array, a response may be received from the storage array that can trigger a Permanent Device Loss (PDL) condition.

In the /var/log/vmkernel.log file on the ESXi Host, you see entries similar to:

2015-07-23T20:34:05.108Z cpu2:33198)WARNING: NMP: nmp_PathDetermineFailure:2872: Cmd (0x85) PDL error (0x5/0x25/0x0) - path vmhba4:C0:T0:L10 device naa.514f0c514ba0000e - triggering path evaluation
2015-07-23T20:34:05.108Z cpu2:33198)NMP: nmp_ThrottleLogForDevice:3178: Cmd 0x85 (0x439e16768f40, 34616) to dev "naa.514f0c514ba0000e" on path "vmhba4:C0:T0:L10" Failed: H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x25 0x0. Act:EVAL
2015-07-23T20:34:05.108Z cpu2:33198)WARNING: NMP: nmp_DeviceRequestFastDeviceProbe:237: NMP device "naa.514f0c514ba0000e" state in doubt; requested fast path state update...
2015-07-23T20:34:05.108Z cpu2:33198)ScsiDeviceIO: 2646: Cmd(0x439e16768f40) 0x85, CmdSN 0x385c from world 34616 to dev "naa.514f0c514ba0000e" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x25 0x0.
All commands listed in the errors are:

Cmd(0x85)
You see these exact responses from the storage array to the 0x85 command:

failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x25 0x0
In ESXi 6.0 Update 2, you may see the preceding-noted messaging but these may be coincident with:
- Widespread IO timeouts and subsequent stops with the H:0x5 failure code
  
  For example:
  
  2016-03-10T20:36:12.203Z cpu2:33199)ScsiDeviceIO: 2646: Cmd(0x439e05768e50) 0x28, CmdSN 0x379d from world 34527 to dev "naa.514f0c514ba0000e" failed H:0x5 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0.
  
  Note: The preceding log excerpts are only examples. Date, time, and environmental variables may vary depending on your environment.
- Hosts may take a long time to reconnect to vCenter after reboot or hosts may enter a Not Responding state in vCenter Server
- Storage-related tasks such as HBA rescan may take a very long time to complete

Environment

VMware vSphere ESXi 6.0

Cause

This issue occurs because in this specific scenario, a ESXi host has sent a request for SMART data to a storage array, and the array has responded with an unexpected illegal request error. The response received by the host triggers a Permanent Device Loss (PDL) detection, and the kernel performs a path evaluation to determine if there is need to fail the link in question.

In ESXi 6.0 Update 2, a change to the PDL response behavior can result in this condition blocking additional IO operations, resulting in stops and timeouts described in the Symptoms section. For more information, see General Storage Issues section in the ESXi 6.0 Update 2 Release Notes.

Resolution

This is a known issue affecting VMware ESXi 6.0 with EMC XtremIO storage arrays.

This is a firmware issue on the storage array and the vendor will need to be contacted for a fixed version is available.

To work around these issues, use one of these options:

Note: VMware recommends that you apply one of these workarounds prior to upgrading your ESXi hosts to ESXi 6.0 Update 2.

Option 1

Disable the SMART daemon (smartd). However, this affects local data capture of SMART data for internal drives.

Note: VMware recommends against disabling smartd if possible.

To stop and disable smartd on an ESXi Host:

Connect to the ESXi host through an SSH or a local console session using root credentials. For more information, see Using ESXi Shell in ESXi 5.x and 6.0 (2004746).
Stop the smartd service using this command:

/etc/init.d/smartd stop
Disable the service using this command:

chkconfig smartd off

Option 2

Depending on the array type used in the environment, there may be a firmware update available from the manufacturer that prevents the PDL sense code from being returned in response to the SMART command.

VMware recommends that you engage with your array vendor to determine if there is a firmware update that can be applied to prevent this behavior.

Notes:

While there may be other applicable storage platforms, this issue is known to be present on certain firmware versions of the EMC XtremIO storage array. This issue is known to be resolved in the 4.0.2-80 (or later ) firmware version for the XtremIO storage array.
Note: The preceding link was correct as of June 29, 2016. If you find the link is broken, provide a feedback and a VMware employee will update the link.
As with all storage platforms, contact your array vendor for a final assessment of any given behavior or fixes in a firmware release.

Additional Information

For more information on Permanent Device Loss, its detection and causes, see:

Permanent Device Loss (PDL) and All-Paths-Down (APD) in vSphere 5.x and 6.x (2004684).
The Detecting PDL Conditions section in the vSphere 6.0 Storage guide.

For more information about SCSI sense code errors, see Understanding SCSI device/target NMP errors/conditions in ESX/ESXi 4.x and ESXi 5.x/6.0 (1030381).

Understanding SCSI device/target NMP errors/conditions in ESX/ESXi 4.x and ESXi 5.x/6.0
Changing the Disk.MaxLUN parameter on ESXi Hosts
Permanent Device Loss (PDL) and All-Paths-Down (APD) in vSphere 5.x and 6.x
Using ESXi Shell in ESXi 5.x and 6.x
VMware ESXi 6.0 ホストから SCSI コマンド 0x85 を発行すると PDL エラーが発生する
从采用 EMC XtremIO 存储阵列的 VMware ESXi 6.0 主机发出 0x85 SCSI 命令可能会导致 PDL 错误