vPostgres service fails to start on vCenter Server due to several entries in TRUSTED_ROOT

Products

VMware vCenter Server

Issue/Introduction

Symptoms:

Service vmware-vpostgres fails to start on vCenter Server.
Most of the other services as well fail to start, such as vmware-vpxd-svcs and vmware-vpxd. For more information about vCenter services, see Stopping, Starting or Restarting VMware vCenter Server Appliance 6.x & above services.
Can't connect to vCenter Database getting the below error,. For more information about VCDB, see Interacting with the vCenter Server Appliance 6.5/6.7/7.0 embedded vPostgres Database

Failed to connect to database: ODBC error: (08001) - [unixODBC]could not connect to server: Connection refused
--> Is the server running on host "localhost" (127.0.0.1) and accepting
--> TCP/IP connections on port 5432?

vPostgres logs are not updated with any events.

Note: vPostgres are located in /var/log/vmware/vpostgres/postgresql-xx.log

In the vpxd.log, you may see entries similar to

2020-07-07T20:18:01.671Z error vpxd[35339] [Originator@6876 sub=vpxdVdb] [VpxdVdb::SetDBType] Failed to connect to database: ODBC error: (08001) - [unixODBC]could not connect to server: Connection refused
-->     Is the server running on host "localhost" (127.0.0.1) and accepting
-->     TCP/IP connections on port 5432?
-->    Retry attempt: 16305 ...

Note: The vpxd.log is located at /var/log/vmware/vpxd/vpxd.log

vmon-syslog.log doesn't indicate why vmware-vpostgres is not starting.

2020-07-07T20:31:03.805884+00:00 notice vmon  Received start request for vmware-vpostgres
2020-07-07T20:31:03.806089+00:00 notice vmon  <vmware-vpostgres-prestart> Constructed command: /opt/vmware/vpostgres/current/scripts/pg_pre_start
|
|
<vmware-vpostgres-prestart> Constructed command: /opt/vmware/vpostgres/current/scripts/pg_pre_start
2020-07-07T20:33:03.040400+00:00 notice vmon  Executing service batch op API_HEALTH. IgnoreFail=1, service count=10
2020-07-07T20:33:03.040808+00:00 notice vmon  <vapi-endpoint-healthcmd> Constructed command: /usr/bin/python /usr/lib/vmware-vmon/vmonApiHealthCmd.py -n vapi-endpoint -u /vapiendpoint/health -t 30
2020-07-07T20:33:03.041005+00:00 notice vmon  <rhttpproxy-healthcmd> Constructed command: /usr/bin/python /usr/lib/vmware-rhttpproxy/rhttpproxy-vmon-apihealth.py
2020-07-07T20:33:03.041184+00:00 notice vmon  <vmware-vpostgres> Skip service health check. State STOPPED, Curr request 1
2020-07-07T20:33:03.041356+00:00 notice vmon  <vcha> Skip service health check. State STOPPED, Curr request 0
2020-07-07T20:33:03.041535+00:00 notice vmon  <vmware-postgres-archiver> Skip service health check. State STOPPED, Curr request 0
2020-07-07T20:33:03.041711+00:00 notice vmon  <vpxd-svcs> Skip service health check. State STOPPED, Curr request 0
2020-07-07T20:33:03.041882+00:00 notice vmon  <vpxd> Skip service health check. State STOPPING, Curr request 1
2020-07-07T20:33:03.042051+00:00 notice vmon  <sps> Skip service health check. State STOPPED, Curr request 0
2020-07-07T20:33:03.042221+00:00 notice vmon  <rbd> Skip service health check. State STOPPED, Curr request 0
2020-07-07T20:33:03.042407+00:00 notice vmon  <pschealth> Skip service health check. State STOPPED, Curr request 0
2020-07-07T20:33:03.354545+00:00 notice vmon  Successfully executed service batch operation API_HEALTH.

Note: The vmon-syslog.log is located at /var/log/vmware/vmon/vmon-syslog.log

In vpxd-svcs.log you may see the blow error

SQL Error: org.apache.commons.dbcp.SQLNestedException: Cannot create PoolableConnectionFactory (Connection refused. Check that the hostname and port are correct and that the postmaster is accepting TCP/IP connections.)

Note: The vpxd-svcs.log is located at /var/log/vmware/vpxd-svcs/vpxd-svcs.log

Environment

VMware vCenter Server Appliance 6.5.x
VMware vCenter Server 7.0.x
VMware vCenter Server Appliance 6.0.x
VMware vCenter Server Appliance 6.7.x

Cause

This is caused due to corrupted certificates under /etc/ssl/certs , which causes an unexpectedly high number of certificate entries in TRUSTED_ROOT_CRLS store.

To confirm the cause of the issue, run the below command on the VCSA. If you are using an external PSC, run the following command on the vCenter and PSC both:
# /usr/lib/vmware-vmafd/bin/vecs-cli entry list --store TRUSTED_ROOT_CRLS | grep Number

Output should look like:
Number of entries in store : 3165

Notes:

If the output of the command is a big number (like hundreds or thousands), proceed with the resolution in this article.
In case of External Platform Service Controller, the above command will be run on the Platform Service Controller and vCenter both per the above.

Resolution

To resolve this issue, remove the extra entries in the TRUSTED_ROOT_CRLS store following the below steps:

Take an offline Snapshot of the VCSA virtual machine (and the Platform Service Controller virtual machine in case of external PSC).

Caution: Do NOT skip this step.

Connect to the VCSA (and the external PSC, if you are using one) through ssh.
Download the "crl-fix.sh" script attached to this article and upload to the impacted VCSA/PSC in the /tmp (or to the external Platform Service Controller) using WinSCP or copy its contents to a text file on the appliance using vi editor.

Note: If you get an error of the below while connecting to the appliance via WinSCP run the following command. For more information, see Error when uploading files to vCenter Server Appliance using WinSCP (2107727).
# chsh -s /bin/bash root as per above the link.

Host is not communicating for more than 15 seconds. If the problem repeats, try turning off 'Optimize connection buffer size'.
or
Cannot initialize SFTP protocol. Is the host running an SFTP server?

Browse to the /tmp directory.

# cd /tmp

Run the below command to make the file executable.

# chmod +x crl-fix.sh

Note: The script will take some time before showing any progress.

Run the crl-fix.sh script.

# ./crl-fix.sh

Note: If you got the below error while running the script:
bash: ./crl-fix.sh: /bin/bash^M: bad interpreter: No such file or directory

This error is caused by DOS carriage returns added to the script when copying from a Windows-based text editor. To resolve this problem, run this command and rerun the script:

# sed -i -e 's/\r$//' crl-fix.sh

Notes:

The script may take some time before showing any progress depending on the number of entries in the TRUSTED_ROOT_CRLS store.
When the script completes, it should stop the vmafdd service and start it again as below:

Restart services of the VCSA and/or the external PSC

# service-control --stop --all
# service-control --start --all

Attachments

crl-fix get_app