Search the VMware Knowledge Base (KB)
View by Article ID

ESXi host fails to rejoin VMware vSAN cluster following reboot (2148122)

  • 0 Ratings
Language Editions

Symptoms

  • In certain circumstances, a vSAN host may be unable to rejoin the vSAN cluster following a reboot. After logging in to the ESXi host shell to check cluster membership, the host will report that clustering is not enabled. You see an output similar to:

    # esxcli vsan cluster get
    Virtual SAN Clustering is not enabled on this host


  • The system boot log (/var/log/boot.gz) reveals that the vSAN state machine shuts down shortly after initializing:

    2016-12-05T17:11:55.932Z cpu2:36230 opID=d848dca4)CMMDS: CMMDSLogStateTransition:1217: Transitioning(56eb3316-74f3-b796-47ac-782bcb516e1f) from Invalid to Discovery: (Reason: State machine initialization)
    <additional messaging omitted>
    2016-12-05T17:12:01.838Z cpu2:33403)CMMDS: CMMDSLogStateTransition:1217: Transitioning(56eb3316-74f3-b796-47ac-782bcb516e1f) from Agent to Invalid: (Reason: State machine deinitialization)


  • Attempting to rejoin the cluster manually results in a vsantraced failure. You see an output similar to:

    # esxcli vsan cluster join -u 52924056-5029-3645-ad7f-ea237d36f577
    Failed to join the host in VSAN cluster (Failed to start vsantraced (return code 2))

Purpose

After a rebooting a VMware vSAN host, the host may fail to rejoin the vSAN cluster.

Cause

This issue occurs if the directory used for vsantraces storage has more than ~2,000 files (the exact number may vary). The vsantraced initialization process attempts to determine the size of the directory and if the file argument list is too long, the vsantraced process fails to start. The vsantraced initialization failure results in a failure to rejoin the vSAN cluster.
This issue arises due to large numbers of vSAN Observer performance history files remaining in the directory used for vsantraces storage.

Resolution

The issue related to large numbers of vSAN Observer performance history files building up over time is resolved in ESXi 6.0 Update 3, available at VMware Downloads and ESXi 6.5.0d(vSAN 6.6), available at VMware Downloads

If this issue has already been encountered and the host is failing to rejoin the vSAN cluster, this procedure will verify the behavior and provide a work-around. To work around this issue, old files must be removed from the vsantraces storage directory.

Note: If all old vSAN Observer files are removed, there will be no historical performance data for this host until vSAN Observer creates new performance data over time. This does not result in any loss of performance data in the vSAN Performance Service or vCenter Server.

  1. Determine the current vsantraces storage location by examining the configuration file:
    #  cat /etc/vmware/vsan/vsantraced.conf |grep ^VSANTRACED_VOLUME
    For example:
    #  cat /etc/vmware/vsan/vsantraced.conf |grep ^VSANTRACED_VOLUME
    VSANTRACED_VOLUME="/vmfs/volumes/Datastore1/scratch/vsantraces"

  2. Move to the directory using this command:
    cd "<directory>"

    For example:
    cd "/vmfs/volumes/Datastore1/scratch/vsantraces"

  3. Determine how many files are in the directory using this command:
    # ls |wc -l

    For example:
    # ls |wc -l
    5013


    In this example, the directory has 5,013 files. This number needs to be reduced by at least ~3,000 files. 

  4. Confirm that the files belong to vSAN Observer using this command:
    # ls |grep vsanObserver|wc -l

    For example:
    # ls |grep vsanObserver|wc -l
    4986


    In this example, the directory has 4,986 vSAN Observer files. The oldest of these can be removed.

  5. Move the required number of vSAN Observer files to a different directory, or delete the required number of vSAN Observer files.
    Notes
    • Wildcard expansion (with a '*' in the argument) may fail if the argument list is too long.
    • The directory should contain no more than 2,000 files. Reducing the number of files to below 2,000 is recommended.
    • Rerun step 3 to determine the number of files in the directory

  6. Rejoin the vSAN cluster by rebooting the host or running this command:
    # esxcli vsan cluster join -u <cluster uuid>

    For example:
    # esxcli vsan cluster join -u 52924056-5029-3645-ad7f-ea237d36f577

Additional Information

The cluster UUID for the command in step 6 of the resolution can be obtained from another host in the vSAN cluster:
  1. Log in to the ESXi shell of another host in the vSAN cluster
  2. Get the cluster information from esxcli:
    # esxcli vsan cluster get |grep "Sub-Cluster UUID"

    This will return the UUID required for the 'esxcli vsan cluster join' command. For example:
    # esxcli vsan cluster get |grep "Sub-Cluster UUID"
        Sub-Cluster UUID: 52924056-5029-3645-ad7f-ea237d36f577

See Also

Update History

02/24/2017 - Added 6.0 U3 fix details.

Language Editions

ja,2148961;zh_cn,2151903

Request a Product Feature

To request a new product feature or to provide feedback on a VMware product, please visit the Request a Product Feature page.

Feedback

  • 0 Ratings

Did this article help you?
This article resolved my issue.
This article did not resolve my issue.
This article helped but additional information was required to resolve my issue.

What can we do to improve this information? (4000 or fewer characters)




Please enter the Captcha code before clicking Submit.
  • 0 Ratings
Actions
KB: