Extra read rate on VMFS datastores after upgrading to NSX-T 3.2.0.1

Products

VMware NSX Networking

Issue/Introduction

Symptoms:
NSX version 3.2.0
All ESXi hosts see a read rate increase of approximately 2 MB/s to VMFS datastores.
The problem will happen even when the hosts are in Maintenance mode or when there are no VMs on hosts.

Cause

All functions related to nestdb may be impacted if nestdb crashes because of a disk full event. The "nestdb_remedy" plugin was introduced which monitors disk usage and will try to restart nestdb service when disks become available.
By default, nestdb_remedy will check every 20 seconds which increases the disk read rate.

Resolution

This issue is resolved in NSX-T 3.2.2.
This issue will also not occur in NSX 4.0 as the nestdb_remedy plugin is not used on 4.0.

Workaround:
Below are example steps to change the check interval to 120 seconds.

Step 1. Create NSGroup via API OR UI with desired host Transport Nodes

A) For UI option, select Manager view in the NSX UI and navigate to Inventory > Groups. Create NSGroup and add desired Transport Nodes.
NSGroup UUID needed in Step 3 is listed on the Overview tab.

B) For API option, Host Transport Node UUIDs can be found in the UI (System > Fabric > Nodes > Host Transport Nodes), or from get nodes output in the Manager CLI

API to create NSGroup:
POST https://{{manager_ip}}/api/v1/ns-groups

{
    "display_name":"219NodeGroup",
    "members" : [ {
      "resource_type" : "NSGroupSimpleExpression",
      "target_type" : "TransportNode",
      "target_property" : "id",
      "op" : "EQUALS",
      "value" : "42492eaf-0fbe-4b2c-bf84-19b0b1c9913e"      <----- target host 1

    }, {
      "resource_type" : "NSGroupSimpleExpression",
      "target_type" : "TransportNode",
      "target_property" : "id",
      "op" : "EQUALS",
      "value" : "0d558e7a-33b1-421d-a35b-0598ab3f354b",      <----- target host 2
    } ]
}

Response:

{
    "members": [
...
    ],
    "member_count": 2,
    "resource_type": "NSGroup",
    "id": "b079681f-9b27-4b1d-bc5a-da0e4906f5fd", <----- NSGroup ID
    "display_name": "219NodeGroup",
    "_create_time": 1648532005403,
    "_create_user": "admin",
    "_last_modified_time": 1648532005403,
    "_last_modified_user": "admin",
    "_system_owned": false,
    "_protection": "NOT_PROTECTED",
    "_revision": 0
}

Step 2. Create plugin profile with CHECK_INTERVAL:120

POST https://{{manager_ip}}/api/v1/systemhealth/profiles/

{
    "display_name": "nestdb-remedy-control-profile-1",
    "enabled": true,
    "config": "{\"CHECK_INTERVAL\": 120, \"MAX_TRY_COUNT_FOR_A_CRASH\": 2, \"MIN_INTERVAL_BETWEEN_TWO_REMEDIATION\": 300}", <----- set config like this
    "plugin_id":"08878948-f2ae-42b6-8c63-c03091cac158" <----- use this UUID
}

Response:
{
    "type": "NETWORK",
    "enabled": true,
    "config": "{\"CHECK_INTERVAL\": 120, \"MAX_TRY_COUNT_FOR_A_CRASH\": 2, \"MIN_INTERVAL_BETWEEN_TWO_REMEDIATION\": 300}",
    "plugin_id": "08878948-f2ae-42b6-8c63-c03091cac158",
    "resource_type": "SystemHealthAgentProfile",
    "id": "440fe269-f5e0-499d-b7ea-f21f68386835", <----- profile ID
    "display_name": "nestdb-remedy-control-profile-1",
    "_create_time": 1648540461728,
    "_create_user": "admin",
    "_last_modified_time": 1648540461728,
    "_last_modified_user": "admin",
    "_system_owned": false,
    "_protection": "NOT_PROTECTED",
    "_revision": 0
}

Step 3. Apply the created profile to the nsgroup

POST https://{{manager_ip}}/api/v1/service-configs

{
"display_name":"nestdb-control-service-config-1",
"profiles":[{
"profile_type":"SHAProfile",
"target_id":"440fe269-f5e0-499d-b7ea-f21f68386835"}], <----- profile ID
"applied_to":[{
   "target_id":"b079681f-9b27-4b1d-bc5a-da0e4906f5fd", <----- NSGroup ID
   "target_type":"NSGroup"
}]
}

Step 4. Confirm the new check_interval is effective on plugin "nestdb_remedy" on an ESXi host, ex UUID: 42492eaf-0fbe-4b2c-bf84-19b0b1c9913e

GET https://{{manager_ip}}/api/v1/systemhealth/plugins/status/42492eaf-0fbe-4b2c-bf84-19b0b1c9913e

Response:
[
...
{
            "id": "08878948-f2ae-42b6-8c63-c03091cac158",
            "name": "nestdb_remedy",
            "status": "NORMAL",
            "profile": "NAME: nestdb-remedy-control-profile-1, ENABLE: True, CHECK_INTERVAL: 120, MAX_TRY_COUNT_FOR_A_CRASH: 2, MIN_INTERVAL_BETWEEN_TWO_REMEDIATION: 300",
            "detail": ""
        },
...
]

Note: There is no negative impact due to this workaround. The only consequence is in case of a disk usage issue, nestdb_remedy will take more time to restart the nestdb service.

In rare cases the issue may still persist after following the above steps, to resolve restart proton on all 3 Managers from the admin shell:
> restart service manager

Additional Information

Impact/Risks:
High read data volume on all ESX servers configured with NSX-T from every datastore.
With many ESXi hosts, the combined effect on storage will be high.