TrimmedException leads to missing configuration information in NSX 4.1.0
search cancel

TrimmedException leads to missing configuration information in NSX 4.1.0

book

Article ID: 317745

calendar_today

Updated On:

Products

VMware NSX Networking

Issue/Introduction

Symptoms:
- NSX version is 4.1.0
- Creation of some NSX configuration fails. For example: Groups, Logical Switches, DFW Rules, etc.
- The existing configuration is missing in UI and API.
 
The following three log signatures are seen when this issue occurs:
 
1. "TrimmedException" messages are seen in NSX Manager proton logs.
Example in /var/log/proton/nsxapi.log:
 
2023-04-24T12:44:30.970Z  WARN http-nio-127.0.0.1-7440-exec-25 AbstractQueuedStreamView 4840 Fill_Read_Queue[1a2@-1] Trim encountered.
org.corfudb.runtime.exceptions.TrimmedException: Trimmed address: 661313                                                         <<<<
        at org.corfudb.runtime.view.AddressSpaceView.isLogDataValid(AddressSpaceView.java:789) ~[?:?]
        at org.corfudb.runtime.view.AddressSpaceView.checkLogDataThrowException(AddressSpaceView.java:816) ~[?:?]
        at org.corfudb.runtime.view.AddressSpaceView.fetch(AddressSpaceView.java:810) ~[?:?]
        at org.corfudb.runtime.view.AddressSpaceView.lambda$read$9(AddressSpaceView.java:367) ~[?:?]
        at io.micrometer.core.instrument.composite.CompositeTimer.record(CompositeTimer.java:57) ~[?:?]
        at org.corfudb.common.metrics.micrometer.MicroMeterUtils.lambda$time$6(MicroMeterUtils.java:121) ~[?:?]
        at java.util.Optional.map(Optional.java:215) ~[?:1.8.0_362]
 
2.  "UnreachableClusterException" or "UnrecoverableCorfuInterruptedError" or "UnrecoverableCorfuError" errors are observed in the Corfu logs.  These exceptions are preceded by a corfu-server start/restart.
Example in /var/log/corfu/corfu.9000.log:
 
2023-04-24T12:01:43.022Z | ERROR |       Cmpt-9000-chkpter |  o.c.r.o.MVOCorfuCompileProxy | abortTransaction[ImmutableCorfuTable[f2a]] Abort Transaction with Exception {}
org.corfudb.runtime.exceptions.UnreachableClusterException: Runtime stalled. Invoking systemDownHandler after 60 unsuccessful tries.
    at org.corfudb.infrastructure.ManagementServer.lambda$new$0(ManagementServer.java:99)
    at org.corfudb.runtime.view.AbstractView.layoutHelper(AbstractView.java:176)
    at org.corfudb.runtime.view.AbstractView.layoutHelper(AbstractView.java:61)
    at org.corfudb.runtime.view.AddressSpaceView.fetchAll(AddressSpaceView.java:744)
    at org.corfudb.runtime.view.AddressSpaceView.lambda$read$13(AddressSpaceView.java:489)
 
3.  The Corfu compactor leader logs in one of the Managers indicates a reduction in the number of tables that were checkpointed, and the timestamp of the reduction is close to the TrimmedException timestamps.
 
Example: less corfu-compactor-leader.1.log.gz | grep "Total time taken for the compaction cycle"
 
2023-04-24T10:50:27.002Z | INFO  |              Cmpt-9000-chkpter |               compactor-leader | Total time taken for the compaction cycle: 488499ms for 997 tables with status COMPLETED
.
.
2023-04-24T12:24:14.552Z | INFO  |              Cmpt-9000-chkpter |               compactor-leader | Total time taken for the compaction cycle: 561543ms for 949 tables with status COMPLETED       <----- Notice that the # of tables got reduced
2023-04-24T12:37:15.632Z | INFO  |              Cmpt-9000-chkpter |               compactor-leader | Total time taken for the compaction cycle: 458044ms for 949 tables with status COMPLETED
2023-04-24T12:52:58.115Z | INFO  |              Cmpt-9000-chkpter |               compactor-leader | Total time taken for the compaction cycle: 498881ms for 949 tables with status COMPLETED
2023-04-24T13:07:54.492Z | INFO  |              Cmpt-9000-chkpter |               compactor-leader | Total time taken for the compaction cycle: 492395ms for 949 tables with status COMPLETED
2023-04-24T13:21:37.181Z | INFO  |              Cmpt-9000-chkpter |               compactor-leader | Total time taken for the compaction cycle: 410507ms for 949 tables with status COMPLETED
2023-04-24T13:36:39.089Z | INFO  |              Cmpt-9000-chkpter |               compactor-leader | Total time taken for the compaction cycle: 409029ms for 949 tables with status COMPLETED
2023-04-24T13:51:27.566Z | INFO  |              Cmpt-9000-chkpter |               compactor-leader | Total time taken for the compaction cycle: 395965ms for 949 tables with status COMPLETED
2023-04-24T14:32:10.576Z | INFO  |              Cmpt-9000-chkpter |               compactor-leader | Total time taken for the compaction cycle: 567154ms for 997 tables with status COMPLETED
2023-04-24T14:49:27.768Z | INFO  |              Cmpt-9000-chkpter |               compactor-leader | Total time taken for the compaction cycle: 510723ms for 997 tables with status COMPLETED
 

Cause

NSX initializes many processes at the start or restart of the Corfu server, one of which is the compactor process. When compactor initialization is in progress, the first table open results in accessing the registry table. This action syncs the contents of the registry table into the Corfu object layer. If an UnreachableClusterException is encountered while the sync is happening, it is not handled and is ignored in the compactor.
This results in RegistryTable being in an inconsistent state with the database.
 
For example, on the disk, the registry table has N number of tables but the Object layer of Corfu only has N-48 tables as the sync is incomplete due to a ClusterUnreachableException. Now when the compactor tries to read the list of tables to compact, it sees only N-48 tables as the object layer has inconsistent data, resulting in data loss for 48 tables. From this point, every cycle of the compactor only gets N-48 tables until the next restart of Corfu server when the sync succeeds, and it has all N tables. The checkpoint works correctly for all N tables, but those 48 tables have no data in them anymore.

Resolution

This issue is resolved in NSX 4.1.0.2 and higher versions available at VMware Downloads .

Please be advised that the VMware NSX team has decided to withdraw the NSX 4.1.0 release from the download page in favor of NSX 4.1.0.2. 
Customers who have downloaded and deployed NSX 4.1.0 remain supported but are strongly advised to upgrade to NSX 4.1.0.2 or higher at their earliest convenience.


Workaround:

Recovery
If this issue has been experienced, it is necessary to restore the NSX Manager from backup.

Prevention
To prevent this issue from occurring, VMware recommends an upgrade to NSX 4.1.0.2.

If for some reason an upgrade is not possible then the following steps can be followed to prevent the issue from occurring in a 4.1.0 environment.
 

The workaround should be performed on all the 3 NSX managers sequentially. The user should wait for the manager cluster to stabilize before proceeding to apply the workaround on the next manager.

Step 1. Download the Debian package
    Link: https://ftpsite.vmware.com/download?domain=FTPSITE&id=61e3498f56fe4db6b45b957b38cd0f0a-1c2f44d7feae46788cd2ce930ed81cc0

Step 2.  Move the deb package to /image/ directory via winscp or other means and validate the md5sum
        root@nsxmanager:~# md5sum /image/nsx-corfu-server_4.1.20230509211150.7973.1_all.deb
        b665c2c58df41645bc4468b16d22e5fb  nsx-corfu-server_4.1.20230509211150.7973.1_all.deb

Step 3. Install the Debian package
        root@nsxm-PR3182420:~# dpkg -i /image/nsx-corfu-server_4.1.20230509211150.7973.1_all.deb
        (Reading database ... 61995 files and directories currently installed.)
        Preparing to unpack nsx-corfu-server_4.1.20230509211150.7973.1_all.deb ...
        Synchronizing state of corfu-server.service with SysV service script with /lib/systemd/systemd-sysv-install.
        Executing: /lib/systemd/systemd-sysv-install disable corfu-server
        Removed /etc/systemd/system/nsx-custom.target.wants/corfu-server.service.
        Synchronizing state of corfu-nonconfig-server.service with SysV service script with /lib/systemd/systemd-sysv-install.
        Executing: /lib/systemd/systemd-sysv-install disable corfu-nonconfig-server
        Removed /etc/systemd/system/nsx-custom.target.wants/corfu-nonconfig-server.service.
        Software Integrity Check is not Enabled
        Unpacking nsx-corfu-server (4.1.20230509211150.7973.1) over (4.1.20230215040838.1096.1) ...
        Setting up nsx-corfu-server (4.1.20230509211150.7973.1) ...
        Synchronizing state of corfu-server.service with SysV service script with /lib/systemd/systemd-sysv-install.
        Executing: /lib/systemd/systemd-sysv-install enable corfu-server
        Created symlink /etc/systemd/system/nsx-custom.target.wants/corfu-server.service -> /lib/systemd/system/corfu-server.service.
        Synchronizing state of corfu-nonconfig-server.service with SysV service script with /lib/systemd/systemd-sysv-install.
        Executing: /lib/systemd/systemd-sysv-install enable corfu-nonconfig-server
        Created symlink /etc/systemd/system/nsx-custom.target.wants/corfu-nonconfig-server.service -> /lib/systemd/system/corfu-nonconfig-server.service.
        Synchronizing state of corfu-log-replication-server.service with SysV service script with /lib/systemd/systemd-sysv-install.
        Executing: /lib/systemd/systemd-sysv-install enable corfu-log-replication-server
        Created symlink /etc/systemd/system/nsx-custom.target.wants/corfu-log-replication-server.service -> /lib/systemd/system/corfu-log-replication-server.service.
        Starting corfu-server
        Starting corfu-nonconfig-server
        Processing triggers for systemd (245.4-4ubuntu3.13) ...

Step 4. Validate the latest Tanuki version. The Version (a106826) verifies the installation

        root@nsxmanager:~# grep Version /var/log/corfu/tanuki.log
        INFO   | jvm 1    | 2023/05/23 11:28:51 | Version (3fb572f)
        INFO   | jvm 1    | 2023/05/23 17:02:02 | Version (a106826)          <<<<<<<<<<<<< The a106826
        
Step 5. Wait for the Manager cluster to stabilize and repeat steps 2,3, and 4 on the other 2 Managers

    root@nsxmanager:~# su admin -c get cluster status
    Tue May 23 2023 UTC 17:09:20.067
    Cluster Id: ef688269-9d2e-4345-ba51-203feff85e46
    Overall Status: STABLE

    Group Type: DATASTORE
    Group Status: STABLE

    Members:
        UUID                                       FQDN                                       IP        IPv6                                    STATUS
        7e4f2842-xxxx-xxxx-xxxx-3488xxxxxx      nsxm-xxxxx                             X.X.X.1        -                                       UP   
        8c4f3842-xxxx-xxxx-xxxx-3488xxxxxx      nsxm-xxxxx                             X.X.X.2        -                                       UP   
        9c5c1234-xxxx-xxxx-xxxx-3488xxxxxx      nsxm-xxxxx                             X.X.X.3        -                                       UP

 

** NOTE: The workaround should be reapplied in the case of Manager node redeployment when on 4.1.0.


Additional Information

Impact/Risks:
Some NSX configurations may get deleted.