How to remove a vRealize Automation appliance from a cluster

Products

VMware Aria Suite

Issue/Introduction

vRealize Automation does not provide a means to remove an appliance node from an existing cluster, which may be required for business reasons or for correcting an issue.

This article provides steps to remove the vRealize Automation appliance from a cluster.

Environment

VMware vRealize Automation 6.2
VMware vRealize Automation 7.x
VMware vRealize Automation 7.3.x
VMware vRealize Automation 7.5.x
VMware vRealize Automation 7.1.x
VMware vRealize Automation 7.4.x
VMware vRealize Automation 6.2.x
VMware vRealize Automation 7.2.x
VMware vRealize Automation 7.0.x
VMware vRealize Automation 7.6.x
VMware vRealize Orchestrator 7.x
VMware vRealize Automation 6.x

Resolution

Note:

This procedure has been tested on vRealize Automation 7.3
The following steps below can significantly impact the health of the vRealize Automation environment. It is strongly suggested to take appropriate steps to backup and snapshot your environment so that the changes can be rolled back if issues are encountered.
It is assumed that the node to be removed is a replica and does not have the primary postgres instance.
If you are using vRealize Automation 7.5 skip steps 2,3,5,9 and 10.
IMPORTANT: If the environment this KB is being executed against has been hotpatched with cumulative updates in 7.4 or 7.5, additional updates in PostgreSQL are required. Perform Step #4 if the environment has hotpatches installed, otherwise skip this step.

To remove the node:

Go to all directories in each tenant and verify their connector does not point to the failing node. Change if necessary.
Connect through SSH or the console to connect to the replica node and extract the name of the node from rabbitmq under the NODENAME variable in the following file:

/etc/rabbitmq/rabbitmq-env.conf

If the node is unavailable, the default name would be the following assuming the FQDN is node-short-domain-name:

rabbit@node-short-domain-name

Alternatively, the rabbitmq node name is also displayed in the vRA Settings > Messaging tab of the VAMI (i.e., the web management interface found on https://<vra_appliance_node_fqdn>:5480) of the other healthy nodes.

If SYNCHRONOUS is configured for automatic failover, before removing the clustered nodes, swap to ASYNC under database tab.
Modify hf_execution_cmd and hf_patch_nodes tables to allow for cascading deletes:
1. SSH Into primary appliance
2. su postgres
3. psql -d vcac
  - ALTER TABLE hf_patch_nodes DROP CONSTRAINT hf_patch_nodes_node_Id_fkey;
  - ALTER TABLE hf_patch_nodes ADD CONSTRAINT hf_patch_nodes_node_Id_fkey FOREIGN KEY (node_id) REFERENCES public.cluster_nodes (node_id) ON DELETE CASCADE;
  - ALTER TABLE hf_execution_cmd DROP CONSTRAINT hf_execution_cmd_cmd_id_fkey;
  - ALTER TABLE hf_execution_cmd ADD CONSTRAINT hf_execution_cmd_cmd_id_fkey FOREIGN KEY (cmd_id) REFERENCES public.cluster_commands (cmd_id) ON DELETE CASCADE;
Power down the Replica node.
Log into the primary vRealize Automation Virtual Appliance Management Interface (VAMI)

Example: https://<vra_appliance_node_fqdn>:5480

Navigate to vRA Settings > Cluster tab.
Remove the failing node from cluster using the 'Delete' button.
In a SSH or console session on the primary vRA node, run the command:

Capture the registered nodename:
rabbitmqctl cluster_status
Note! The node name maybe in FQDN format. Ensure the correct name is used during the next step.
rabbitmqctl forget_cluster_node rabbit@node-domain-name

rabbit@node-short-domain-name will be the value extracted from the replica in step 1 above.

In an SSH or console session on the primary vRealize Automation appliance, run the commands:

sed -i "/failed-node-fqdn/d" "/etc/haproxy/conf.d/10-psql.cfg" "/etc/haproxy/conf.d/20-vcac.cfg"

service haproxy restart

/usr/sbin/vcac-config cluster-config-ping-nodes --services haproxy

The value failed-node-fqdn will be the FQDN of the replica node being removed.

Log in to the vRA UI with a user from the vsphere.local domain who has Tenant Admin permissions on each tenant. This is needed to verify step 11.
If there were directories which connector was pointing to the failing node (were not updated during step 1), they need to be deleted and re-created.
In an SSH or console session on the primary vRealize Automation appliance, run these commands:

echo "Delete from \"saas\".\"Connector\" where host like '%failed-node-fqdn%';" | su - postgres /opt/vmware/vpostgres/current/bin/psql vcac

echo "Delete from \"saas\".\"OAuth2Client\" where \"OAuth2Client\".\"redirectUri\" LIKE '%failed-node-fqdn%';" | su - postgres /opt/vmware/vpostgres/current/bin/psql vcac

echo "Delete from \"saas\".\"FederationArtifacts\" where \"FederationArtifacts\".\"strData\" LIKE '%failed-node-fqdn%';" | su - postgres /opt/vmware/vpostgres/current/bin/psql vcac

echo "Delete from \"saas\".\"ServiceInstance\" where \"ServiceInstance\".\"hostName\" LIKE '%failed-node-fqdn%';" | su - postgres /opt/vmware/vpostgres/current/bin/psql vcac

The value of failed-node-fqdn is the FQDN of the failed vRealize Automation appliance.

Note: Some of the above commands may print a result DELETE 0 depending on the current configuration.

In an SSH or console session on the primary vRealize Automation appliance, run the command:

service elasticsearch restart

If "curl -XGET 'http://localhost:9200/_nodes'" executed on the current vRA primary node still returns

“error” : “MasterNotDiscoveredException{waited for {30s}}, “status” : "503"

In an SSH or console session on the primary vRealize Automation appliance, run the command:

echo "Select * from \"saas\".\"ServiceInstance\" ;" | su - postgres /opt/vmware/vpostgres/current/bin/psql vcac

The result should not contain any records where hostName is failed-node-fqdn;

The result should not contain more than one record where hostName is the primary-node-fqdn - if such exist, they should be deleted; For primary node, the one with the most recent createDate value, should be kept.

echo "Delete from \"saas\".\"ServiceInstance\" where \"ServiceInstance\".\"id\" = <idNum>;" | su - postgres /opt/vmware/vpostgres/current/bin/psql vcac

The value of <idNum> is the id of the record to delete.

Validate that the deleted node is not the primary synchronization connector by confirming the value of "isDirectorySyncEnabled" is set to true or "t":

select * from "saas"."Connector";

Note: For 3 node clusters, ensure that there is 1 connector set as DirectorySyncEnabled = true.

If the remaining connector node is False or "f" run the below query:

echo "update \"saas\".\"Connector\" set \"isDirectorySyncEnabled\" = 't' where \"name\" = 'connector_name';" | su - postgres /opt/vmware/vpostgres/current/bin/psql vcac

(Optional) If you are using embedded vRO (version >= 7.1), access the advanced Orchestrator Cluster Management page in Control Center, at https://your_orchestrator_server_IP_or_DNS_name:8283/vco-controlcenter/#/control-app/ha?remove-nodes to remove the leftover records.

System health verification:

Verify all services in VAMi are registered: https://<vra_appliance_node_fqdn>:5480
Verify rabbitmq service status: service rabbitmq-server status.
Verify 'rabbitmqctl cluster_status' result does not include failed node.
Verify elasticsearch cluster status: curl -XGET 'http://localhost:9200/_nodes' does not have repeated node or the node removed in the list