All Paths Down for a storage device

Products

VMware vSphere ESXi

Issue/Introduction

Symptoms

When the All Paths Down (APD) timeout for a storage device connected to your ESXi host expires without the device being recovered from the APD state.
The ESXi host shows as Disconnected/Not responding in vCenter Server
You are unable to connect directly to the ESXi host using the vSphere Client
In the /var/log/vmkernel.log file, you see entries similar to:
cpu1:2049)WARNING: NMP: nmp_IssueCommandToDevice:2954:I/O could not be issued to device "naa.60a98000572d54724a34642d71325763" due to Not found
cpu1:2049)WARNING: NMP: nmp_DeviceRetryCommand:133:Device "naa.60a98000572d54724a34642d71325763": awaiting fast path state update for failover with I/O blocked. No prior reservation exists on the device.
cpu1:2049)WARNING: NMP: nmp_DeviceStartLoop:721:NMP Device "naa.60a98000572d54724a34642d71325763" is blocked. Not starting I/O from device.
cpu1:2642)WARNING: NMP: nmpDeviceAttemptFailover:599:Retry world failover device "naa.60a98000572d54724a34642d71325763" - issuing command 0x4124007ba7c0
cpu1:2642)WARNING: NMP: nmpDeviceAttemptFailover:658:Retry world failover device "naa.60a98000572d54724a34642d71325763" - failed to issue command due to Not found (APD), try again...
cpu1:2642)WARNING: NMP: nmpDeviceAttemptFailover:708:Logical device "naa.60a98000572d54724a34642d71325763": awaiting fast path state update...

You see the error /var/log/vobd.log :

YYYY-MM-DD T00:26:51.504Z: [APDCorrelator] 2682686563317us: [esx.problem.storage.apd.timeout] Device or filesystem with identifier [11ace9d3-7bebe4e8] has entered the All Paths Down Timeout state after being in the All Paths Down state for 140 seconds. I/Os will now be fast failed.

Receiving the following event message when a storage device connected to your ESXi host enters the all path down (APD) state:

T13:41:33.250z cpu4:8598)StorageApdHandler: 692: APD Handle Created with lock.

Notes:

This log message indicates that the system had an APD event, but does not mean it is currently at the APD state. This message will be seen at the boot time of the host.
The messages indicate that the system has turned on a timer that allows your ESXi host to continue retrying attempts to re-establish connectivity with the device for a limited time period.
By default, the APD timeout is set to 140 seconds.
The preceding log excerpts are only examples. Date, time, and environmental variables may vary depending on your environment.

Cause

All-Paths-Down (APD) situation occurs when all paths to a device are down. As there is no indication of whether this is a permanent or temporary device loss, the ESXi host keeps reattempting to establish connectivity. APD-style situations commonly occur when the LUN is incorrectly un-presented from the ESXi host.

The timeout period begins when the storage device becomes unavailable to your ESXi host and enters the APD state. By default, the APD timeout is set to 140 seconds. While the timeout lasts, the host continues its attempts to reestablish connectivity with the device. When the timeout ends and the device does not recover, the host stops its attempts to retry any I/O that is not coming from virtual machines.
The reasons for an APD state can be, for example, a failed switch or a disconnected storage cable.

Impact

The device and the datastores on the device become unavailable. Virtual machine I/O will continue to be retried.
This has an impact on the management agents, as their commands are not responded to until the device is again accessible. This causes the ESXi host to become inaccessible/not-responding in the vCenter Server.

Environment

VMware vSphere ESXi 6.7
VMware vSphere ESXi 6.5
VMware vSphere ESXi 7.0.0
VMware vSphere ESXi 6.0
VMware vSphere ESXi 8.0

Resolution

Due to the nature of an APD situation, there is no clean way to recover.

The APD situation needs to be resolved at the storage array/fabric layer to restore connectivity to the host.
All affected ESXi hosts may require a reboot to remove any residual references to the affected devices that are in an APD state.
To resolve this issue, identify the cause of the disconnected LUNs by reviewing the environment, such as Storage array, SAN switch, Device failure, etc.

If the virtual machines on the datastores remain responsive, you can power off the virtual machines or migrate them to a different datastore or host.

Additional Information

For information on block storage devices, see Path Redundancy to Storage Device Degraded.
For information on NFS storage devices, see Troubleshooting NFS datastore connectivity issues.
For more information on other APD event messages, see:
- Storage device has recovered from the APD state

A storage device is considered to be in the APD state when it becomes unavailable to your ESXi host for an unspecified period of time. In contrast with the permanent device loss (PDL) state, the host treats the APD state as transient and expects the device to be available again. For more information, see Handling Transient APD Conditions.