TrimmedException leads to missing configuration information in NSX 4.1.0

Products

VMware NSX Networking

Issue/Introduction

Symptoms:
- NSX version is 4.1.0
- Creation of some NSX configuration fails. For example: Groups, Logical Switches, DFW Rules, etc.
- The existing configuration is missing in UI and API.

The following three log signatures are seen when this issue occurs:

1. "TrimmedException" messages are seen in NSX Manager proton logs.
Example in /var/log/proton/nsxapi.log:

2023-04-24T12:44:30.970Z WARN http-nio-127.0.0.1-7440-exec-25 AbstractQueuedStreamView 4840 Fill_Read_Queue[1a2@-1] Trim encountered.
org.corfudb.runtime.exceptions.TrimmedException: Trimmed address: 661313                                                         <<<<
        at org.corfudb.runtime.view.AddressSpaceView.isLogDataValid(AddressSpaceView.java:789) ~[?:?]
        at org.corfudb.runtime.view.AddressSpaceView.checkLogDataThrowException(AddressSpaceView.java:816) ~[?:?]
        at org.corfudb.runtime.view.AddressSpaceView.fetch(AddressSpaceView.java:810) ~[?:?]
        at org.corfudb.runtime.view.AddressSpaceView.lambda$read$9(AddressSpaceView.java:367) ~[?:?]
        at io.micrometer.core.instrument.composite.CompositeTimer.record(CompositeTimer.java:57) ~[?:?]
        at org.corfudb.common.metrics.micrometer.MicroMeterUtils.lambda$time$6(MicroMeterUtils.java:121) ~[?:?]
        at java.util.Optional.map(Optional.java:215) ~[?:1.8.0_362]

2. "UnreachableClusterException" or "UnrecoverableCorfuInterruptedError" or "UnrecoverableCorfuError" errors are observed in the Corfu logs. These exceptions are preceded by a corfu-server start/restart.
Example in /var/log/corfu/corfu.9000.log:

2023-04-24T12:01:43.022Z | ERROR |       Cmpt-9000-chkpter | o.c.r.o.MVOCorfuCompileProxy | abortTransaction[ImmutableCorfuTable[f2a]] Abort Transaction with Exception {}
org.corfudb.runtime.exceptions.UnreachableClusterException: Runtime stalled. Invoking systemDownHandler after 60 unsuccessful tries.
    at org.corfudb.infrastructure.ManagementServer.lambda$new$0(ManagementServer.java:99)
    at org.corfudb.runtime.view.AbstractView.layoutHelper(AbstractView.java:176)
    at org.corfudb.runtime.view.AbstractView.layoutHelper(AbstractView.java:61)
    at org.corfudb.runtime.view.AddressSpaceView.fetchAll(AddressSpaceView.java:744)
    at org.corfudb.runtime.view.AddressSpaceView.lambda$read$13(AddressSpaceView.java:489)

3. The Corfu compactor leader logs in one of the Managers indicates a reduction in the number of tables that were checkpointed, and the timestamp of the reduction is close to the TrimmedException timestamps.

Example: less corfu-compactor-leader.1.log.gz | grep "Total time taken for the compaction cycle"

2023-04-24T10:50:27.002Z | INFO |              Cmpt-9000-chkpter |               compactor-leader | Total time taken for the compaction cycle: 488499ms for 997 tables with status COMPLETED
.
.
2023-04-24T12:24:14.552Z | INFO |              Cmpt-9000-chkpter |               compactor-leader | Total time taken for the compaction cycle: 561543ms for 949 tables with status COMPLETED       <----- Notice that the # of tables got reduced
2023-04-24T12:37:15.632Z | INFO |              Cmpt-9000-chkpter |               compactor-leader | Total time taken for the compaction cycle: 458044ms for 949 tables with status COMPLETED
2023-04-24T12:52:58.115Z | INFO |              Cmpt-9000-chkpter |               compactor-leader | Total time taken for the compaction cycle: 498881ms for 949 tables with status COMPLETED
2023-04-24T13:07:54.492Z | INFO |              Cmpt-9000-chkpter |               compactor-leader | Total time taken for the compaction cycle: 492395ms for 949 tables with status COMPLETED
2023-04-24T13:21:37.181Z | INFO |              Cmpt-9000-chkpter |               compactor-leader | Total time taken for the compaction cycle: 410507ms for 949 tables with status COMPLETED
2023-04-24T13:36:39.089Z | INFO |              Cmpt-9000-chkpter |               compactor-leader | Total time taken for the compaction cycle: 409029ms for 949 tables with status COMPLETED
2023-04-24T13:51:27.566Z | INFO |              Cmpt-9000-chkpter |               compactor-leader | Total time taken for the compaction cycle: 395965ms for 949 tables with status COMPLETED
2023-04-24T14:32:10.576Z | INFO |              Cmpt-9000-chkpter |               compactor-leader | Total time taken for the compaction cycle: 567154ms for 997 tables with status COMPLETED
2023-04-24T14:49:27.768Z | INFO |              Cmpt-9000-chkpter |               compactor-leader | Total time taken for the compaction cycle: 510723ms for 997 tables with status COMPLETED

Cause

NSX initializes many processes at the start or restart of the Corfu server, one of which is the compactor process. When compactor initialization is in progress, the first table open results in accessing the registry table. This action syncs the contents of the registry table into the Corfu object layer. If an UnreachableClusterException is encountered while the sync is happening, it is not handled and is ignored in the compactor.
This results in RegistryTable being in an inconsistent state with the database.

For example, on the disk, the registry table has N number of tables but the Object layer of Corfu only has N-48 tables as the sync is incomplete due to a ClusterUnreachableException. Now when the compactor tries to read the list of tables to compact, it sees only N-48 tables as the object layer has inconsistent data, resulting in data loss for 48 tables. From this point, every cycle of the compactor only gets N-48 tables until the next restart of Corfu server when the sync succeeds, and it has all N tables. The checkpoint works correctly for all N tables, but those 48 tables have no data in them anymore.

Resolution

This issue is resolved in NSX 4.1.0.2 and higher versions available at VMware Downloads .

Please be advised that the VMware NSX team has decided to withdraw the NSX 4.1.0 release from the download page in favor of NSX 4.1.0.2.
Customers who have downloaded and deployed NSX 4.1.0 remain supported but are strongly advised to upgrade to NSX 4.1.0.2 or higher at their earliest convenience.

Workaround:

Recovery
If this issue has been experienced, it is necessary to restore the NSX Manager from backup.

Prevention
To prevent this issue from occurring, VMware recommends an upgrade to NSX 4.1.0.2.

If for some reason an upgrade is not possible then the following steps can be followed to prevent the issue from occurring in a 4.1.0 environment.

The workaround should be performed on all the 3 NSX managers sequentially. The user should wait for the manager cluster to stabilize before proceeding to apply the workaround on the next manager.

Step 1. Download the Debian package
Link: https://ftpsite.vmware.com/download?domain=FTPSITE&id=61e3498f56fe4db6b45b957b38cd0f0a-1c2f44d7feae46788cd2ce930ed81cc0

Step 2. Move the deb package to /image/ directory via winscp or other means and validate the md5sum
root@nsxmanager:~# md5sum /image/nsx-corfu-server_4.1.20230509211150.7973.1_all.deb
b665c2c58df41645bc4468b16d22e5fb nsx-corfu-server_4.1.20230509211150.7973.1_all.deb

Step 3. Install the Debian package
root@nsxm-PR3182420:~# dpkg -i /image/nsx-corfu-server_4.1.20230509211150.7973.1_all.deb
(Reading database ... 61995 files and directories currently installed.)
Preparing to unpack nsx-corfu-server_4.1.20230509211150.7973.1_all.deb ...
Synchronizing state of corfu-server.service with SysV service script with /lib/systemd/systemd-sysv-install.
Executing: /lib/systemd/systemd-sysv-install disable corfu-server
Removed /etc/systemd/system/nsx-custom.target.wants/corfu-server.service.
Synchronizing state of corfu-nonconfig-server.service with SysV service script with /lib/systemd/systemd-sysv-install.
Executing: /lib/systemd/systemd-sysv-install disable corfu-nonconfig-server
Removed /etc/systemd/system/nsx-custom.target.wants/corfu-nonconfig-server.service.
Software Integrity Check is not Enabled
Unpacking nsx-corfu-server (4.1.20230509211150.7973.1) over (4.1.20230215040838.1096.1) ...
Setting up nsx-corfu-server (4.1.20230509211150.7973.1) ...
Synchronizing state of corfu-server.service with SysV service script with /lib/systemd/systemd-sysv-install.
Executing: /lib/systemd/systemd-sysv-install enable corfu-server
Created symlink /etc/systemd/system/nsx-custom.target.wants/corfu-server.service -> /lib/systemd/system/corfu-server.service.
Synchronizing state of corfu-nonconfig-server.service with SysV service script with /lib/systemd/systemd-sysv-install.
Executing: /lib/systemd/systemd-sysv-install enable corfu-nonconfig-server
Created symlink /etc/systemd/system/nsx-custom.target.wants/corfu-nonconfig-server.service -> /lib/systemd/system/corfu-nonconfig-server.service.
Synchronizing state of corfu-log-replication-server.service with SysV service script with /lib/systemd/systemd-sysv-install.
Executing: /lib/systemd/systemd-sysv-install enable corfu-log-replication-server
Created symlink /etc/systemd/system/nsx-custom.target.wants/corfu-log-replication-server.service -> /lib/systemd/system/corfu-log-replication-server.service.
Starting corfu-server
Starting corfu-nonconfig-server
Processing triggers for systemd (245.4-4ubuntu3.13) ...

Step 4. Validate the latest Tanuki version. The Version (a106826) verifies the installation

root@nsxmanager:~# grep Version /var/log/corfu/tanuki.log
INFO | jvm 1 | 2023/05/23 11:28:51 | Version (3fb572f)
INFO | jvm 1 | 2023/05/23 17:02:02 | Version (a106826) <<<<<<<<<<<<< The a106826

Step 5. Wait for the Manager cluster to stabilize and repeat steps 2,3, and 4 on the other 2 Managers

root@nsxmanager:~# su admin -c get cluster status
Tue May 23 2023 UTC 17:09:20.067
Cluster Id: ef688269-9d2e-4345-ba51-203feff85e46
Overall Status: STABLE

Group Type: DATASTORE
Group Status: STABLE

Members:
UUID FQDN IP IPv6 STATUS
7e4f2842-xxxx-xxxx-xxxx-3488xxxxxx nsxm-xxxxx X.X.X.1 - UP
8c4f3842-xxxx-xxxx-xxxx-3488xxxxxx nsxm-xxxxx X.X.X.2 - UP
9c5c1234-xxxx-xxxx-xxxx-3488xxxxxx nsxm-xxxxx X.X.X.3 - UP

** NOTE: The workaround should be reapplied in the case of Manager node redeployment when on 4.1.0.

Additional Information

Impact/Risks:
Some NSX configurations may get deleted.