Replacing a faulty NSX-T manager node in a VCF environment
search cancel

Replacing a faulty NSX-T manager node in a VCF environment

book

Article ID: 314670

calendar_today

Updated On:

Products

VMware Cloud Foundation VMware NSX Networking

Issue/Introduction

This article provides procedures for replacing a single node from a backup in a 3-node NSX-T Manager cluster within a VCF environment.


Symptoms:

The NSX cluster's health will be in degraded state when there is a faulty NSX manager node and can block several VCF operations.


Environment

VMware NSX-T Data Center
VMware NSX-T Data Center 2.5.x
VMware Cloud Foundation 3.8.x
VMware Cloud Foundation 4.x
VMware Cloud Foundation 3.9.x

Resolution

  1. Basic Info

  • Tested on VCF version 3.9.1 & 4.0
  • Applies to VCF 4.x
  1. Procedure to replace one NSX-T manager node in 3-node cluster.

In this procedure, the 3-node NSX-T manager cluster has a single node down. In this example, there are 3x NSX-T manager nodes in the MGMT cluster:

  • 172.17.110.22 vi1nsxmanager.dellrack1.vmware.corp ----> vip
  • 172.17.110.23 vi1nsxmanager1.dellrack1.vmware.corp
  • 172.17.110.24 vi1nsxmanager2.dellrack1.vmware.corp
  • 172.17.110.25 vi1nsxmanager3.dellrack1.vmware.corp
Assume that vi1nsxmanager2 has gone down and must be replaced. You must obtain the following information for the node that needs a replacement.
 

2.1 Prerequisites

You must obtain the following information as prerequisites for the unrepairable NSX-T manager:
  • VM Name
  • FQDN hostname
  • IP address, netmask & gateway
  • DNS, NTP servers
  • admin, audit, root user passwords

If you do not know the admin and root passwords, follow the instructions in the password chapter of the VMware Cloud Foundation Operations and Administration guide to retrieve them from the SDDC Manager inventory. If you want to change these two passwords, should do so after restoring the NSX-T manager VMs and using the SDDC Manager password update function.


For the NSX-T manager cluster

  • NSX-T manager VM size. When you deploy a NSX-T Manager OVA, you must specify the node size.  To determine the size, log into the management domain vCenter, navigate to the summary page for one of the NSX-T manager VMs for this NSX-T instance, and expand the VM hardware pane. Compare the memory size and number of cpus of that VM to the NSX Manager VM Resource Requirements section in the the following product documents, and from the table, determine the size to use in step 2.2.4
  • NSX-T 3.0: Please refer to NSX Manager VM and Host Transport Node System Requirements.
  • NSX-T 2.5: Please refer to NSX Manager VM and Host Transport Node System Requirements.

For one operational NSX-T Manager:

  • FQDN hostname and IP address

Note: Make sure you download the OVA image before you start and check the md5sum of the OVA file as well. Section 2.2.4 lists the procedures for determining the specific OVA to download.

2.2 Procedure

2.2.1. Power off faulty NSX-T manager VM

  • From the vSphere Web Client > Hosts and Clusters, select vi1nsxmanager2 VM
  • Select Power Off

2.2.2. Delete faulty NSX-T manager VM

  • From vSphere Web Client > Hosts and Clusters, select vi1nsxmanager2 VM
  • Select Delete from disk
  • If you want to be safe, you can rename it instead, but remember to delete it once procedure is completed

2.2.3. Delete faulty NSX-T manager node from NSX-T manager cluster

  • From NSX-T manager UI > System > Overview, select 172.17.110.24, click the wheel icon, click Delete.
  • If the faulty NSX-T manager is the primary node, the wheel icon will not be available, delete it using the CLI as described below.

Note: Steps differ based on if the node is primary or secondary for deleting the manager

2.2.3.1. Delete a secondary node

  • The NSX-T manager UI will report error: compute manager could not find....
  • Click the wheel icon, there is a new option available, click Force Delete

2.2.3.2. Delete a primary node

  • First you must obtain the UUID of the faulty NSX-T manager, SSH into one operational NSX-T Manager

ssh admin@vi1nsxmanager3

  • Issue the get cluster status command, record the UUID of the faulty NSX-T manager

get cluster status

  • Issue the detach command:

detach node <uuid>

For example: detach node 77a01dab-58f1-4a86-8134-0bfc3e0c40d9

2.2.3.3. Wait for deletion complete

  • From NSX-T manager UI > Home > Dashboard > System, wait until 2-node NSX-T manager cluster has green stable status

2.2.3.4. Verification

After removing one node from the cluster and before adding a new one, make sure you run "get cluster status" command and verify the services are UP on the nodes.

2.2.4. Add new NSX-T manager node

The new node cannot be added from the NSX-T manager UI. The NSX-T manager VMs are in the MGMT cluster, but the NSX-T manager cluster only knows about the VI WLD vCenter, not the MGMT domain vCenter, so the add nodes wizard will not allow adding a node to the MGMT cluster.

  • From NSX-T manager UI > System > Overview, record the NSX version and build number of the NSX-T manager nodes

For example: 2.4.2.1.0.14374085

  • Download the exact build from customerconnect.vmware.com. e.g. nsx-unified-appliance-2.4.2.1.0.14374085.ova
  • From vSphere Web Client > Hosts and Clusters > SDDC-Cluster1 > Mgmt-ResourcePool, select Deploy OVF template...
  • Select local file, select the OVA file

nsx-unified-appliance-2.4.2.1.0.14374085.ova, click Next

  • Input the VM Name, vi1nsxmanager2
  • Select the SDDC-Datacenter, click Next
  • Select the SDDC-Cluster1, select the Mgmt-ResourcePool, click Next
  • Choose the correct size as determined in Section 2 above, click Next
  • Select the vSAN datastore, For example: sfo01-m01-vsan, click Next
  • Select the MGMT portgroup, For example: SDDC-DPortGroup-Mgmt, click Next
  • Input the information for the faulty NSX-T manager node

For example:
fqdn=vi1nsxmanager2.dellrack1.vmware.corp
role=nsx-manager-nsx-controller
gateway=172.17.110.1
ipv4=172.17.110.24
netmask=255.255.255.0
dns=172.17.110.251
domain=dellrack1.vmware.corp
ntp=172.17.110.251
ssh=enabled (checked)
allowroot=disabled (unchecked)

  • Review and click Finish
  • Wait for task completion
  • Power On the VM

2.2.5. Join the new NSX-T manager node to the cluster

  • Lookup the IP address of the NSX-T manager node nslookup vi1nsxmanager1
  • You can also ping vi1nsxmanager1 and it will display the IP address
  • SSH into an operational NSX-T manager node

ssh admin@vi1nsxmanager1

  • Get the cluster ID

get cluster config | find Id:

  • Get the API thumbprint

get certificate api thumbprint

  • SSH into the new NSX-T manager node

ssh admin@vi1nsxmanager2

  • join the new node to the cluster

join <vi1nsxmanager1 ip> cluster-id <uuid> thumbprint <thumbprint> username admin

For example:
join 172.17.110.23 cluster-id 3ca96913-7d42-4cce-a69c-365a7c52b545 thumbprint dd35cf8826bcb9cd6bd21deddb81a7447cc726fcfa393d71781d492a3302ca1e username admin

  • wait for 5 minutes for join to complete

2.2.6. Wait for addition complete

  • From NSX-T manager UI > Home > Dashboard > System, wait until 3-node NSX-T manager cluster has a green stable status.

2.2.7. Reassign old VMCA or external CA-signed certificate to the new NSX-T manager node.

2.2.7.1 If the certificate exists follow the below mentioned steps:
  • From NSX-T manager UI > System > Certificates > Certificates, check the VMCA/external CA-signed certificate for the faulty NSX-T manager node.

For example:
Issued By=CA
Issued To=vi1nsxmanager2.dellrack1.vmware.corp

  • Note that you should only check the checkbox, but do not open the detail screen
  • Hover the mouse over the ID field, the second column, a pop-up appears with the ID of the certificate
  • Record the certificate ID number, note this is the second column
  • For the next step, you require to send a POST request to the new NSX-T manager node, you can either use Postman in Windows or curl in linux, the example here will use curl in Linux.
  • Issue the POST request to the new NSX-T manager node to assign the old certificate to the new node

curl -H 'Accept: application/json' -H 'Content-Type: application/json'\
--insecure -u 'admin:<admin password>' -X POST\
'https://<new nsx-t mgr fqdn or ip>/api/v1/node/services/http?action=apply_certificate&certificate_id=<certificate id>'

For example:
curl -H 'Accept: application/json' -H 'Content-Type: application/json'\
--insecure -u 'admin:VMware123!VMware123!' -X POST\
'https://vi1nsxmanager2.dellrack1.vmware.corp/api/v1/node/services\/http?action=apply_certificate&certificate_id=24781ed5-7721-49bb-801d- cc8a4415d60e'

  • The curl command does not return a response, since the http server is restarted as part of the command.
  • From the vSphere Web Client > Host and Clusters, select the new NSX-T manager VM, click Restart Guest OS.
  • Wait 5 minutes for the node to reboot.

Specific to VCF 4.0: If assigning the certificate fails because the certificate revocation list (CRL) could not be verified, please follow the steps in KB 78794 to address the problem. If you decide to disable CRL checking in order to assign the certificate, re-enable CRL checking once the certificate has been assigned.

2.2.7.2 If the certificate does not exits, please follow Replace Expired or Self-signed NSX-T Manager Certificates with VMCA-Signed Certificates for more information.

2.2.8. Wait for reboot to complete

  • From NSX-T manager UI > Home > Dashboard > System, wait until 3-node NSX-T manager cluster has a green stable status

2.2.9. Verify VMCA cert of the new NSX-T manager node

  • This step requires openssl
  • Issue the openssl command to retrieve the certificate of the new NSX-T manager node

For example:
echo | openssl s_client -no_ign_eof -showcerts -connect \
vi1nsxmanager2.dellrack1.vmware.corp:443 > nsx2.pem

  • Issue the openssl command to display the certificate

For example:
openssl x509 -in nsx2.pem -noout -text | more

  • Verify the Issuer is psc-1
  • Verify the Subject CN=vi1nsxmanager2.dellrack1.vmware.corp

2.2.10. Refresh SDDC Manager SSH Key Store for VCF 4.0

This step is specific to VCF 4.0, as a final step, you need to update the SSH keys SDDC Manager saves for the NSX-T managers. VMware offers a script that automates this process. Please follow the Refresh SDDC Manager SSH Keys procedure documented in the KB 79004.

2.2.11 Update VM Anti-Affinity Rule

When Cloud Foundation deploys NSX-T Manager, it creates a VM anti-affinity rule to prevent the VMs of the same NSX-T Manager cluster from running on the same host. In this step, you need to add the newly deployed replacement VM to the rule for this NSX-T Manager cluster.
Log in to the management domain vCenter Server, and select Menu > Hosts and Clusters.
In the Navigator pane, select the management cluster
Select Configure > VM/Host Rules.
Finally, add the VM to the correct "separate virtual machine" rule. The rule for the management-domain NSX-T Manager cluster is named anti-affinity-rule-nsxt, while the rule for workload domains has the form "<NSXT Mgr VIP FQDN> - NSX-T Managers Anti Affinity Rule". Once you locate the rule, click edit, and add the newly deployed VM (e.g., vi1nsxmanager2) to it.