Search the VMware Knowledge Base (KB)
View by Article ID

Understanding the storage path failover sequence in VMware ESX/ESXi native multipathing (1027963)

  • 27 Ratings

Purpose

This article provides information on the VMware ESX/ESXi storage native multipathing failover sequence, as it is logged in /var/log/vmkernel on ESX Classic, and /var/log/messages on ESXi.

Note: This document pertains specifically to storage path failover as implemented in the VMware multipathing module, the Native Mutipathing Plug-in (NMP). For information about third party multipathing modules, refer to the vendor's documentation.

Resolution

 


Note
: The example scenario in this article uses a S/W iSCSI initiator and a LUN with identifier naa.60060160d5c12200ccd66fd74a81de11.
 
The VMware ESX/ESXi storage multipathing failover sequence is:
  1. The connection along a given path is detected as down or offline. For example:

    vmkernel: 188:04:24:16.970 cpu8:4288)WARNING: iscsi_vmk: iscsivmk_StopConnection: vmhba33:CH:0 T:1 CN:0: iSCSI connection is being marked "OFFLINE"

  2. The ESX/ESXi host stops its iSCSI session. For example:

    vmkernel: 188:04:24:16.970 cpu8:4288)WARNING: iscsi_vmk: iscsivmk_StopConnection: Sess [ISID: 00023d000001 TARGET: iqn.1992-04.com.emc:cx.sl7e2091300074.b1 TPGT: 2 TSIH: 0]
    vmkernel: 188:04:24:16.970 cpu8:4288)WARNING: iscsi_vmk: iscsivmk_StopConnection: Conn [CID: 0 L: 192.168.2.16:50439 R: 192.168.2.8:3260]

  3. As a result of stopping that session, the iSCSI task is aborted. For example:

    vmkernel: 188:04:24:16.970 cpu11:4288)WARNING: iscsi_vmk: iscsivmk_TaskMgmtIssue: vmhba33:CH:0 T:1 L:14 : Task mgmt "Abort Task" with itt=0x5155cba9 (refITT=0x5155cb93) timed out.

  4. The Native Multi-pathing Plugin detects a Host status of 0x1 for the reason that the command in-flight had failed. A host status of 0x1 translates to NO_CONNECT. For details, see SCSI events that can trigger ESX server to fail a LUN over to another path (1003433). For example:

    vmkernel: 188:04:24:16.970 cpu1:4286)NMP: nmp_CompleteCommandForPath: Command 0x28 (0x41000716a200) to NMP device "naa.60060160d5c12200ccd66fd74a81de11" failed on physical path "vmhba33:C0:T1:L7" H:0x1 D:0x0 P:0x0 Possible sense data: 0x2 0x3a 0x1.

  5. Once the NMP receives this host status, it will send a TEST_UNIT_READY (TUR)command down that path to confirm that it is down, before initiating a failover. For example:

    vmkernel: 188:04:24:16.970 cpu1:4286)WARNING: NMP: nmp_DeviceRetryCommand: Device "naa.60060160d5c12200ccd66fd74a81de11": awaiting fast path state update for failover with I/O blocked. No prior reservation exists on the device.

  6. If this command also fails, the ESX/ESXi host's Path Selection Policy (PSP) activates the next path for the device (LUN). For example:

    vmkernel: 188:04:24:16.989 cpu1:4131)vmw_psp_mru: psp_mruSelectPathToActivateInt: Changing active path from vmhba33:C0:T1:L7 to vmhba33:C0:T0:L7 for device "naa.60060160d5c12200ccd66fd74a81de11".

  7. This line indicates that the path change was successful. The NMP retries the queued commands down this path to ensure they complete successfully, despite a failover condition being triggered. For example:

    vmkernel: 188:04:24:17.974 cpu8:4247)WARNING: NMP: nmp_DeviceAttemptFailover: Retry world failover device "naa.60060160d5c12200ccd66fd74a81de11" - issuing command 0x41000716a200

  8. The initial commands may not immediately complete on failover (for example, if the LUN still has pending reservations). ESX/ESXi host sends a LUN reset if there is a pending SCSI reservation against the device or LUN. This ensures that the SCSI-2 based reservation from the previous initiator is broken, so that the ESX/ESXi host can resume I/O upon failover. For example:

    vmkernel: 188:04:24:17.974 cpu12:4108)WARNING: NMP: nmp_CompleteRetryForPath: Retry command 0x28 (0x41000716a200) to NMP device "naa.60060160d5c12200ccd66fd74a81de11" failed on physical path "vmhba33:C0:T0:L7" H:0x0 D:0x2 P:0x0 Valid sense data: 0x6 0x29 0x0

    This translates to:

    Host Status = 0x0 = OK
    Device Status = 0x2 = Check Condition
    Plugin Status = 0x0 = OK
    Sense Key = 0x6 = UNIT ATTENTION
    Additional Sense Code/ASC Qualifier = 0x29/0x0 = POWER ON OR RESET OCCURRED


  9. At this stage, the ESX/ESXi host can retry the next command in the queue:

    Sep 10 13:11:18 laesx01 vmkernel: 188:04:24:17.974 cpu12:4108)WARNING: NMP: nmp_CompleteRetryForPath: Retry world on with device "naa.60060160d5c12200ccd66fd74a81de11" - retry the next command in retry queue
    Sep 10 13:11:18 laesx01 vmkernel: 188:04:24:17.974 cpu12:4108)ScsiDeviceIO: 747: Command 0x28 to device "naa.60060160d5c12200ccd66fd74a81de11" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x6 0x29 0x0.
    Sep 10 13:11:18 laesx01 vmkernel: 188:04:24:17.974 cpu11:4247)WARNING: NMP: nmp_DeviceAttemptFailover: Retry world failover device "naa.60060160d5c12200ccd66fd74a81de11" - issuing command 0x41000706fa00

  10. Indication that the path failover was successful and commands are able to complete via the new path looks similar to:

    vmkernel: 188:04:24:17.975 cpu12:4108)NMP: nmp_CompleteRetryForPath: Retry world recovered device "naa.60060160d5c12200ccd66fd74a81de11"

  11. Finally, as this is a S/W iSCSI-based example, you also see the session marked "ONLINE" again:

    vmkernel: 188:04:24:20.405 cpu9:4288)WARNING: iscsi_vmk: iscsivmk_StartConnection: vmhba33:CH:0 T:1 CN:0: iSCSI connection is being marked "ONLINE"
    vmkernel: 188:04:24:20.405 cpu9:4288)WARNING: iscsi_vmk: iscsivmk_StartConnection: Sess [ISID: 00023d000001 TARGET: iqn.1992-04.com.emc:cx.sl7e2091300074.b1 TPGT: 2 TSIH: 0]
    vmkernel: 188:04:24:20.405 cpu9:4288)WARNING: iscsi_vmk: iscsivmk_StartConnection: Conn [CID: 0 L: 192.168.2.16:52160 R: 192.168.2.8:3260]

Note: Since the storage stack handles failover identically for FC, this sequence, with the exception of steps 1, 2, 3, and 11, applies.

Additional Information

Other SCSI events and codes can trigger path failover. For more information, see SCSI events that can trigger ESX server to fail a LUN over to another path (1003433).

See Also

Update History

09/27/2010 - Added note about the storage stack handling failover identically for FC. 04/05/2013 - Added ESXi 5.1.x to Products 1/5/2016 - added 6.0 to product list. Removed references to "4.x and 5.0".

Request a Product Feature

To request a new product feature or to provide feedback on a VMware product, please visit the Request a Product Feature page.

Feedback

  • 27 Ratings

Did this article help you?
This article resolved my issue.
This article did not resolve my issue.
This article helped but additional information was required to resolve my issue.

What can we do to improve this information? (4000 or fewer characters)




Please enter the Captcha code before clicking Submit.
  • 27 Ratings
Actions
KB: