Understanding lost access to volume messages in ESXi 6.x/7.x

Products

VMware vSphere ESXi

Issue/Introduction

This article provides information on understanding the lost access to volume related messages in ESXi.

The VMFS datastores are monitored through the heartbeats that are issued in the form of write operations approximately once in every 3 seconds to the VMFS volumes from the hosts. Each ESXi host accessing the VMFS datastores expects these heartbeat write I/O operations to complete within a 8 second window. If the heartbeat I/O does not complete within an 8 second window, the I/O is timed out and a subsequent heartbeat I/O is issued. If the total time of the heartbeat I/O does not complete within a 16 second window, the datastore is marked offline and a Lost access to volume log message is generated by hostd to reflect this behavior.

After a VMFS datastore is marked in an offline state, ESXi issues heartbeat I/O to the datastore approximately every 1 second until connectivity is restored. If a heartbeat I/O completes, the datastore is marked back online and host I/O is allowed to continue.

Symptoms:

Virtual machines display as inaccessible.
In the /var/log/hostd.log file, you see entries similar to:

2015-07-02T02:00:11.675Z [4F1E1B70 info 'Vimsvc.ha-eventmgr'] Event 205 : Lost access to volume 54f89e21-4427e506-b968-a0369f519998 (228.154.ds3) due to connectivity issues. Recovery attempt is in progress and outcome will be reported shortly.
2015-07-02T02:00:37.055Z [4F480B70 info 'Vimsvc.ha-eventmgr'] Event 210 : Successfully restored access to volume 54f89e21-4427e506-b968-a0369f519998 (example datastore) following connectivity issues.
In the /var/log/vobd.log file, you see entries similar to:

2015-07-02T02:00:11.673Z: [vmfsCorrelator] 115715089142us: [esx.problem.vmfs.heartbeat.timedout] 54f89e21-4427e506-b968-a0369f519998 example datastore
2015-07-02T02:00:37.054Z: [vmfsCorrelator] 115740470730us: [esx.problem.vmfs.heartbeat.recovered] 54f89e21-4427e506-b968-a0369f519998 example datastore
In the /var/log/vmkernel.log file, you see entries similar to:

2015-07-02T02:00:11.282Z cpu10:36273)HBX: 2832: Waiting for timed out [HB state abcdef02 offset 3444736 gen 549 stampUS 115704005679 uuid 5592d754-21d7d8a7-0a7e-a0369f519998 jrnl <FB 779600> drv 14.60] on vol 'example datastore'
2015-07-02T02:00:37.054Z cpu26:32873)HBX: 258: Reclaimed heartbeat for volume 54f89e21-4427e506-b968-a0369f519998 (example datastore): [Timeout] Offset 3444736
In vCenter Server, you see an event similar to:

Lost access to volume 54f89e21-4427e506-b968-a0369f519998 (example datastore) due to connectivity issues. Recovery attempt is in progress and outcome will be reported shortly.

Note: The preceding log excerpts are only examples. Date, time, and environmental variables may vary depending on your environment.

Environment

VMware vSphere ESXi 6.5
VMware vSphere ESXi 5.5
VMware vSphere ESXi 8.0
VMware vSphere ESXi 6.0
VMware vSphere ESXi 7.0.0
VMware vSphere ESXi 6.7

Resolution

To determine why the heartbeat I/O operations never complete:

Note the date/time when the lost access to volume message was reported and check the ESXi host logs for related information.
Verify that there are no connectivity issues between the ESXi host and the storage device.

Troubleshooting network connectivity issues depends on how your storage is connected. For more information, see Troubleshooting LUN connectivity issues on ESXi hosts (1003955).

Additional Information

VMware Skyline Health Diagnostics for vSphere - FAQ

When the volume is in the lost access to volume state, host I/O is blocked until the heartbeat I/O can be completed. When the first heartbeat time out generates, you can issue subsequent heartbeat reclaim operations to the datastore until the heartbeat can be recovered. The reclaim occurs approximately once every second. Guest operating system should remain online as long as it can sustain the long latency periods of these I/O operations to the VMDK. Until the heartbeat is reclaimed, VMFS fails all virtual machine I/O operations from virtual machines residing on the impacted datastore with a DEVICE BUSY status. For more information, see: