SCSI events that can trigger ESX server to fail a LUN over to another path

Products

VMware

Issue/Introduction

SCSI events trigger a path fail over on the ESX host.
The SCSI and failover events are logged in the file /var/log/vmkernel.

Resolution

Several conditions trigger the ESX server to failover to another available path:

0/1 0x0 0x0 0x0 - DID_NO_CONNECT

When the fabric returns DID_NO_CONNECT status, the ESX host detects that a target is no longer present. The DID_NO_CONNECT status occurs by a fabric switch failure, a disconnected physical cable, or a zoning change that no longer allows the ESX host to see the array.

Note: This is the only sense code that affects both Active/Active and Active/Passive arrays. The other sense codes only trigger failover on Active/Passive arrays.
2/0 0x3 0x4 0x3 - MEDIUM ERROR - LOGICAL UNIT NOT READY

or

2/0 0x5 0x4 0x3 - ILLEGAL REQUEST - LOGICAL UNIT NOT READY

The medium error and illegal request indicate that the LUN is not in a ready state. Manual intervention is required on the array to correct this issue.
2/0 0x2 0x4 0xa - NOT_READY - LUN IS NOT READY AND TARGET PORT IS IN TRANSITION

The LUN is not ready and target port is in transition, error can occur under these conditions:
- The ownership of a LUN is transitioned between storage processors on an active/passive array.
- The LUN is in a transitional state. For example, when a LUN is created for the first time.
0/7 0x0 0x0 0x0 - INTERNAL ERROR - DID_ERROR (Storage Initiator Error)

A new failover condition has been introduced in ESX 3.5 that allows us to recognize when an EMC Clariion SPhangs and issue additional commands to verify its status. When this SCSI code is captured, the ESX host queries the peer SP to see if the original one is alive. If the peer SP cannot get a response, a failover is initiated and the SP is marked as hung/dead.

Note: It is possible to see the 0x7 0x0 0x0 0x0 code for other arrays however this does not necessarily mean the storage controller is offline or not functioning. A storage initiator error is seen for a number of reasons and should be investigated as a separate issue with our support team if these messages are causing an issue with your environment. In the vmkernel.log file you see the entries similar to:

H:0x0 D:0x2 P:0x0 Valid sense data: 0x6 0x29 0x0

Note: There is a known issue with some Emulex firmware which results in this host code being returned. For more/related information, see When using Emulex HBAs, SCSI commands fail with the status: Storage Initiator Error (1029456).
2/0 0x5 0x94 0x1 - ILLEGAL REQUEST - SCSI_ASC_INVALID_REQ_DUE_TO_CURRENT_LU_OWNERSHIP

This code is specific to LSI based arrays (IBM FastT, IBM DS4000 series, SUN StorageTek) and implies that a request is made to the non-owning storage controller. Since AVT (Auto-Volume Transfer) is disabled, the ESX host handles the condition and failover to the other controller.
2/0 0x5 0x25 0x0 - ILLEGAL REQUEST - LOGICAL UNIT NOT SUPPORTED

ESXi 5.0 introduced a new sense code to deal with Permanent Device Loss (PDL) condition. LOGICAL UNIT NOT SUPPORTED is usually set when a LUN is no longer available or is unmapped.

Notes:
- Path Failover is triggered only if other paths to the LUN do not return this sense code. The device is marked as PDL after all paths to the LUN return this sense code.
- A problem with this process is identified in ESXi 6.0 where failover is not triggered when other paths to the LUN are available. This issue is resolved in ESXi 6.0 Update 2, available at VMware Downloads. For more information, see Storage PDL responses may not trigger path failover in vSphere 6.0 (2144657).
2/0 0x2 0x4 0x1 - NOT READY - LOGICAL UNIT IS IN PROCESS OF BECOMING READY

IBM FAStT LSI based arrays use this sense code during a Non-Disruptive Firmware Upgrade to tell the host to stop using each SP in turn.
If the ESX host receives a SCSI code other than those listed above, a failover does not occur.
2/0 0x6 0x2a 0x6 - UNIT ATTENTION - ASYMMETRIC ACCESS STATE CHANGED

2/0 0x2 0x4 0xb - NOT READY - LOGICAL UNIT NOT ACCESSIBLE, TARGET PORT IN STANDBY STATE

These sense code are specific to ALUA environments configured with alua_failover enabled. For more information about ALUA environments, see:

Additional Information

For translated versions of this article, see:

SCSI events that can trigger ESX server to fail a LUN over to another path

Article ID: 345174

Updated On:

Products

Issue/Introduction

Resolution

Additional Information

Feedback