vSAN Health Service - Network Health - Active Multicast connectivity check
search cancel

vSAN Health Service - Network Health - Active Multicast connectivity check

book

Article ID: 326792

calendar_today

Updated On:

Products

VMware vSAN

Issue/Introduction

Q: What does the Network Health - Active Multicast connectivity check do?

If the Network Health - Multicast assessment based on other checks fails, network multicast will be an issue. At that point, an Active Multicast connectivity check is performed. Otherwise, this check is skipped.

This health check captures multicast packets on all ESXi hosts in the cluster for a period of time. It specifically looks for what is known as the CMMDS Master Heartbeat. All ESXi hosts elected to be a vSAN/CMMDS Master (one per partition) sends this heartbeat once every second. Such heartbeats are sent over multicast and all ESXi hosts in the cluster have to receive them in order for the cluster to function properly. Therefore, if an ESXi host sends a heartbeat and another ESXi host does not hear/receive it, it indicates a multicast misconfiguration, usually in the physical network.

This health check uses the packet captures from all the ESXi hosts, and checks which heartbeats were heard by which ESXi hosts, and which ESXi hosts did not hear a certain heartbeat. The health check then attempts to describe the situation it encountered.

Symptoms:
This article explains the Network Health - Active Multicast connectivity check in the vSAN Health Service in vSAN (formerly known as Virtual SAN) and provides details on why it may report an error.

Environment

VMware vSAN 6.5.x
VMware vSAN 6.0.x

Cause

The test relies 100% on the multicast infrastructure configured in the upstream physical switch, hence Multicast is not properly working and the networking team of the customer must be involved.

Resolution

Q: What does it mean when it is in an error state?

The common cases are:
  1. Multicast is not working at all. In this case, the check will identify all ESXi hosts as individual groups that cannot communicate with each other. This usually means the physical switch, to which the ESXi hosts are attached, has multicast disabled.
  2. There is a clear split in the network. ESXi hosts within a group can communicate with each other, whereas the two groups themselves cannot communicate with each other. This usually is a result of network topologies where the first group is attached to one multicast enabled switch, and the second group is attached to another multicast enabled switch, but the two switches are not configured to allow multicast to flow between them.
  3. Multicast connectivity may not be having issues, but no clear groups are forming. For example, host A can receive the heartbeat of host B, but host B cannot receive the heartbeat of host A. This should never happen and indicates a bug in the physical switch. VMware strongly suggests checking the multicast group address using the command: esxcli vsan network list. The output will show the Multicast group used, this same Multicast group must be populated in the IGMP/PIM Multicast groups showing the respective MAC address of each vSAN member.

Q: How does one troubleshoot and fix the error state?

VMware recommends engaging with the network administrator if Network health – Active Multicast connectivity check fails. Use the output of the health check to work with the network administrator. The detailed picture of the categories and the exact groups it falls into helps the network administrator to figure out where the issue may reside.


Workaround:
Due to the nature of this alarm, there is no workaround available in the VMware virtual layer as the test relies on Multicast, the networking team of the customer must be engaged and troubleshoot the Multicat groups.

Additional Information

Impact/Risks:
Having this alarm can impact the vSAN cluster by causing a network partition. This can cause vSAN objects to become inaccessible, and data unavailability that may cause impact to VMs.