vSAN -- DDH -- Disk Groups show as unmounted in the vSphere Web Client

Products

VMware vSAN

Issue/Introduction

The purpose of this article is to provide information regarding vSAN disk monitoring and proactive health analysis.

Note: vSAN monitors solid state drive and magnetic disk drive health and proactively isolates unhealthy devices by unmounting them. It detects the gradual failure of a vSAN disk and isolates the device before congestion builds up within the affected host and the entire vSAN cluster. An alarm is generated from each host whenever an unhealthy device is detected, and an event is generated if an unhealthy device is automatically unmounted.

Symptoms:
When running VMware vSAN (formerly known as Virtual SAN), you experience one or more of these symptoms:

In the vSphere Web Client, under Disk Management for the vSAN Cluster, vSAN Disk Groups or disk group members show unmounted and are greyed-out.
Virtual machines and objects stored in the vSAN cluster become unavailable.
When you run esxcli vsan storage list from the ESXi shell, disks may report: In CMMDS: false
In the /var/log/vmkernel.log file on the ESXi hosts, you see entries similar to:

For an SSD:
VSAN Device Monitor: WARNING - READ Average Latency on VSAN device eui.2114aa100d00001 has exceeded threshold value 50 ms 1 times.
VSAN Device Monitor: Unmounting VSAN diskgroup eui.2114aa100d00001

For a MDD:
VSAN Device Monitor: WARNING - READ Average Latency on VSAN device naa.5000cca080002020 has exceeded threshold value 500 ms 1 times.
VSAN Device Monitor: Unmounting VSAN device naa.5000cca080002020

Environment

VMware vSAN 6.2.x
VMware vSAN 8.0.x
VMware vSAN 6.1.x
VMware vSAN 7.0.x
VMware vSAN 6.6.x
VMware vSAN 6.7.x
VMware vSAN 6.5.x
VMware vSAN 6.x
VMware vSAN 6.0.x
VMware vSAN 5.5.x

Cause

There are multiple reasons vSAN will unmount a Disk Group member disk, but it is possible that vSAN has proactively unmounted a Disk Group member disk in response to environmental conditions, a process known as dying disk handling (DDH) that has the potential to impact overall cluster health. This could indicate a failed or failing disk, and should be investigated with VMware Support.

Some conditions that may result in a vSAN Disk Group or disk group member being proactively unmounted are:

A significant period of high latency is detected on a Solid-State Drive (SSD)
A significant period of high latency is detected on a Magnetic Disk Drive (MDD)

Note: If a problem is detected in the SSD cache disk, the entire affected disk group is unmounted. If there is an issue with a capacity disk, just that single affected magnetic disk is unmounted.

It is possible that this behavior is noted on vSAN Clusters that utilize hardware not present on the VMware Virtual SAN Compatibility Guide (VSAN VCG). For a supported configuration, all hardware utilized in a vSAN cluster must be present in the vSAN Compatibility Guide and marked as compatible with the version of vSAN that is in production in your infrastructure.

Resolution

The following workaround is recommended to be used only as part of data recovery from multiple device failures.
If you believe a disk has been erroneously unmounted, you can remount the disk using these steps:

Open an SSH session to the ESXi host. For more information, see Using ESXi Shell in ESXi 5.x and 6.x (2004746).
Determine the unmounted disk(s) using this command:
# vdq -q

You see an output similar to:
{
"Name" : "mpx.vmhba1:C0:T1:L0",
"VSANUUID" : "52811b1b-8bd3-7216-f38e-006c70088e48",
"State" : "Ineligible for use by VSAN",
"Reason" : "Not mounted on this host",
"IsSSD" : "0",
"IsCapacityFlash": "0",
"IsPDL" : "0",
},

Look for the "Not mounted on this host" string to identify the unmounted disk(s).
Remount the disk or disk group using ESXCLI
1. Remount a single capacity-tier drive:
  # esxcli vsan storage diskgroup mount -d <identifier>
  
  For example:
  # esxcli vsan storage diskgroup mount -d mpx.vmhba1:C0:T1:L0
2. Remount an entire disk group:
  # esxcli vsan storage diskgroup mount -s <identifier>

Notes:

- It is recommended to verify the status of physical disks using host diagnostic tools or SMART data prior to remounting disks/disk groups.
- To mount a drive or drive group fronting SSD using the VSAN UUID:

esxcli vsan storage diskgroup mount -u xxxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxxx
esxcli vsan storage diskgroup mount --uuid=xxxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxxx