Active/Passive or ALUA based storage devices see APD events during storage controller fail-over on ESXi 6.7 & 7.0 hosts

Products

VMware vSphere ESXi

Issue/Introduction

Symptoms:
ESXi 6.7 & 7.0 hosts with active/passive or ALUA based storage devices may see premature APD events during storage controller fail-over scenarios

Purpose:
If you are using ESXi 6.7 & 7.0 hosts with a ALUA based storage array managed by the VMware NMP stack (NFS arrays are excluded), then it is recommended to disable action_OnRetryErrors.

Environment

VMware vSphere ESXi 7.x
VMware vSphere ESXi 6.7

Resolution

Use one of these methods to resolve the issue:

Disable OnRetryErrors using an SATP claim rule (Preferred method)
Disable OnRetryErrors for existing devices (quick fix - non-preferred method)

Disable OnRetryErrors using an SATP claim rule (Preferred method)

Create an SATP claim rule to change the default behavior of action_OnRetryErrors to "off". This setting will need to be applied to every ESXi 6.7 host that has ALUA based storage array Luns mapped to it.

Add the claimrule with an option to disable OnRetryErrors.

esxcli storage nmp satp rule add -V COMPELNT -P VMW_PSP_RR -s VMW_SATP_ALUA -o disable_action_OnRetryErrors

Reload the claim rules to enforce the change:

esxcli storage core claimrule load

Note: The change will take effect immediately for any new LUNs presented but requires a reboot in order to reclaim existing storage devices with the new ruleset.

List the claimrule table to confirm the changes are there:

esxcli storage nmp satp rule list |grep -i comp

Example output:

VMW_SATP_ALUA COMPELNT disable_action_OnRetryErrors user VMW_PSP_RR

To verify the setting against a device, capture the naa id of the device and run this command:

esxcli storage nmp device list | grep -A2 naa.deviceIDhere

For example:

Device Display Name: COMPELNT Fibre Channel Disk (naa.60000000000000000000000000000000)

Storage Array Type: VMW_SATP_ALUA

Storage Array Type Device Config: {implicit_support=on; explicit_support=off; explicit_allow=on; alua_followover=on; action_OnRetryErrors=off; {TPG_id=61445,TPG_state=AO}{TPG_id=61446,TPG_state=AO}}

Disable OnRetryErrors for existing devices (quick fix - non-preferred method)

Disable this setting for existing devices on a live host without a reboot.

Run the following to get a list of all ALUA based storage array and validate the current setting.

esxcli storage nmp device list | grep -A2 COMPELNT

Example output:

Device Display Name: COMPELNT Fibre Channel Disk (naa.60000000000000000000000000000000)

Storage Array Type: VMW_SATP_ALUA

Storage Array Type Device Config: {implicit_support=on; explicit_support=off; explicit_allow=on; alua_followover=on; action_OnRetryErrors=on; {TPG_id=61445,TPG_state=AO}{TPG_id=61446,TPG_state=AO}}

Extract the naa device names where appropriate and run the following to change OnRetryErrors to "off" on a per-device per-host basis.

esxcli storage nmp satp generic deviceconfig set -c disable_action_OnRetryErrors -d naa.xxx

Once all devices are configured, go back and perform step 2a to validate the change worked.
Repeat steps 1-3 for the remaining hosts.

Note: No reboot is required for these changes to take effect. additionally, the SATP claimrule workaround will overwrite this one upon reboot.

Additional Information

A feature called "action_OnRetryErrors" was introduced at the storage array type plugin (SATP) level of ESXi 6.0 to enable automatic path failover capabilities in response to a failure of health probing on a storage path. This feature is disabled by default in ESXi 6.0 and 6.5 but in ESXi 6.7 the default SATP setting for "action_OnRetryErrors" was changed from "off" to "on" for all presentation models (EXPLICIT ALUA ONLY/IMPLICIT ALUA ONLY/EXPLICIT AND IMPLICIT ALUA). But we saw issues with this setting in case of IMPLICIT ALUA ONLY targets, and hence in ESXi 6.7P01 (ESXi670-201912001) and later versions, "action_OnRetryErrors" is disabled for IMPLICIT ALUA ONLY targets and remains enabled for other types.

This change is detectable at the device level by running the following command

esxcli storage nmp device list

Example output:
   Device Display Name: COMPELNT Fibre Channel Disk (naa.60000000000000000000000000000000)
   Storage Array Type: VMW_SATP_ALUA
   Storage Array Type Device Config: {implicit_support=on; explicit_support=off; explicit_allow=on; alua_followover=on; action_OnRetryErrors=on; {TPG_id=61445,TPG_state=AO}{TPG_id=61446,TPG_state=AO}} <==
   Path Selection Policy: VMW_PSP_MRU
   Path Selection Policy Device Config: Current Path=vmhba3:C0:T1:L1
   Path Selection Policy Device Custom Config:
   Working Paths: vmhba3:C0:T1:L1
   Is USB: false

Before we dig into how this change impacts failover behavior we first need to understand how a path is marked as "dead" or "unavailable". See Understanding how paths to a storage/LUN device are marked as Dead

action_onRetryErrors

This setting determines how ESXi will react to a VMK_STORAGE_RETRY_OPERATION response received from a failed path health probing attempt. If "action_OnRetryErrors" setting is enabled, when a VMK_STORAGE_RETRY_OPERATION is received in response to a failed health probing attempt, the path is then marked dead and a failover is initiated after a fixed number of retries. If the setting is disabled, ESXi considers VMK_STORAGE_RETRY_OPERATION as transient and retries IO on same path.

Impact

If there is another path available to receive IO then this failover behavior change won't have an impact. However in the case of some ALUA based arrays such as Dell EMC SC Series ("Compellent") or even general active/passive array configurations, this can be problematic during specific failover scenarios such as a controller failover during a firmware upgrade, or just general controller failover testing as it may take several seconds for the remaining controller to recover ownership of devices previously handled by the degraded controller. Because of the quick failover practice, there may not be a path available for the IO to resume on resulting in a premature APD scenario.