Using a built-in tool to perform a simultaneous shutdown/reboot of all hosts in the vSAN cluster
search cancel

Using a built-in tool to perform a simultaneous shutdown/reboot of all hosts in the vSAN cluster

book

Article ID: 322144

calendar_today

Updated On:

Products

VMware vSAN

Issue/Introduction

This article provides information on how to safely perform a simultaneous shutdown/reboot of all hosts in the vSAN cluster. With ESXi 6.7 U3 (and newer), a built-in tool can be used to avoid the issue mentioned in Knowledge Base article A simultaneous reboot of all hosts in the vSAN cluster may result in data unavailability after a single failure.

Note: This solution is only applicable to a vSAN cluster where all hosts are on 6.7 U3 or later in a normal healthy cluster.


Environment

VMware vSAN 6.7.x
VMware vSAN 7.0.x

Cause

Resolution

Before starting the process of cluster-level shutdown/reboot of the hosts follow the steps below:
1) Check vSAN Health to confirm the cluster is healthy.
2) Disable HA to prevent the cluster registering hosts as failed.
3) Power down all VMs in the cluster running in the vSAN cluster. Unless vCenter Server is running on the cluster. If vCenter Server is hosted in the vSAN cluster, do not power off the vCenter Server VM.
4) For vSphere 7.0 U1 and later, to enable vCLS retreat mode. See vSphere Cluster Services (vCLS) in vSphere 7.0 U1.

Note: Ensure that all vCLS VMs have been completely removed from the cluster before moving on to the next step. This can be checked by selecting the vSAN Cluster > VMs Tab.
   
5) If vCenter Server is hosted in the vSAN cluster, Power off the vCenter Server VM. The vSphere Client becomes unavailable.
6) Run command 'esxcfg-advcfg -s 1 /VSAN/IgnoreClusterMemberListUpdates' on all hosts in the cluster.
7) Log in to one of the hosts in the cluster (except witness host).
8) Run the command below on only one host in the cluster (if it is run on multiple hosts concurrently then it may cause a race condition causing unexpected results)

    python /usr/lib/vmware/vsan/bin/reboot_helper.py prepare

9) Wait until the command returns and prints 'Cluster preparation is done'.


Notes:
If an error occurred at this step, resolve the issue based on the error message and try step #3 again.
If there are unhealthy/disconnected hosts in the cluster, recover/remove them and retry.
Cluster will be fully partitioned after this step - this is expected behavior

10) Put all hosts into maintenance mode with 'No Action' mode.
11) Proceed with the reboot/shutdown.
12) When all hosts are back from the reboot/shutdown, exit all hosts from maintenance mode.

Note: If any host(s) failed to come up, manually recover the host(s) or move the bad host(s) out of the vSAN cluster.

13) Log in to one of the hosts in the cluster (except witness host).
14) Run the below command on only one host in the cluster (if it is run on multiple hosts concurrently then it may cause a race condition causing unexpected results)

    python /usr/lib/vmware/vsan/bin/reboot_helper.py recover

15) Wait until the command returns and prints 'Cluster reboot/power-on is completed successfully!'
16) Run command 'esxcfg-advcfg -s 0 /VSAN/IgnoreClusterMemberListUpdates' on all hosts in the cluster.
17) For vSphere 7.0 U1 and later, If the vCenter Server is hosted in the vSAN cluster, wait for the vCenter Server VM to be powered up and running. To disable vCLS retreat mode. See vSphere Cluster Services (vCLS) in vSphere 7.0 Update 1.
18) Re-enable HA.    

Notes:
If there are unhealthy/disconnected hosts in the cluster, recover/remove them from vSAN cluster. Retry the above commands only after the vSAN health shows all available hosts are green state.

If the environment is a 3 node vSAN cluster, reboot_helper.py recover command will not work in 1 host failure situation.
In this case, Administrator should do following,
1. Temporary remove failure host information from unicastagent list
2. Add the host after running reboot_helper.py recover.

Below are commands to remove and add a host to a vSAN cluster, (please do not run the commands below if you are unaware/not familiar with them and open a ticket with VMware GSS vSAN team for help )
esxcli vsan cluster unicastagent remove -a <IP Address> -t node -u <NodeUuid>
esxcli vsan cluster unicastagent add -t node -u <NodeUuid> -U true -a <IP Address> -p 12321

If IPv6 settings on ESXi are not enabled, the following error may be logged when running reboot_helper.py
This is normal behavior and this error has no impact.
 ERROR:root:Error to run _getIPRouteListFromEsxCLI