At times certain controller APIs could fail due to cleanup of API server reference files

Products

VMware NSX Networking

Issue/Introduction

Symptoms:

NSX Manager GUI displays an NSX Controller node as Disconnected.

In the NSX Manager logs, you see entries similar to:

ERROR http-nio-127.0.0.1-7441-exec-5 BaseRestController:452 - REST API failed : 'I/O error on POST request for " https://x.x.x.x/ws.v1/login ": Remote host closed connection during handshake; nested exception is javax.net.ssl.SSLHandshakeException: Remote host closedconnectionduring handshake'org.springframework.web.client.ResourceAccessException:I/O error on POST request for " https://x.x.x.x/ws.v1/login ": Remote host closedconnection during handshake; nested exception is javax.net.ssl.SSLHandshakeException: Remote host closed connection during handshake
Traceflow fails to work with the warning on the User Interface similar to:

Controller: X.X.X.X communication error, details: I/O error on POST request for " https://x.x.x.x/ws.v1/login ": Remote host closed connection during handshake.

Hardware Gateway status shows as 'Down' in vCenter UI.
Central CLI commands querying the NSX Controller information fails similar to:

nsx-mgr> show logical-switch controller <controller-id> vni <vni-id> brief

Error: 100: I/O error on POST request for " https://controller-ip-address/ws.v1/login ": Connection reset; nested exception is java.net.SocketException: Connection reset.

Note: One or all of the above symptoms may be seen. Whenever there is a failure occurrence, NSX Manager logs will have a corresponding error log.

Environment

VMware NSX for vSphere 6.3.x

Cause

The NSX for vSphere 6.3.3 controller node has a periodic clean-up task that deletes a status file required by the API server if the API server has sufficiently low activity. After the file is deleted, any new connections from the NSX Manager to the API server will fail.

NSX Manager monitors and updates the controller cluster using REST API calls to the controller-cluster members.

The NSX Manager does maintain a persistent connection to each controller's API server for this purpose. Until the connections are disrupted (physical network issues or restart of NSX Manager) the NSX Manager continues to have access to the controller for cluster monitoring, NSX logical switch & router creation and modifications. Only operations such as Traceflow and Central CLI that do not use the same persistent connection will fail. If external events disrupt the persistent TCP connections, NSX Manager will lose the ability to make API connections to controllers.

Note: The controller API server's role is only for management plane access between the controller and NSX Manager. Disruptions to the API server will not have any impact to the controller-cluster operations or the control plane and dataplane states of NSX. Due to the fault-tolerant design of the distributed controller cluster, NSX Manager continues to be able to update the entire controller cluster as long as it has API connectivity to at least one of the controller nodes.

Resolution

This issue is resolved in VMware NSX for vSphere 6.3.4, available at VMware Downloads.

Please note, the password expiration does not impact the Hardware Gateway, but it does impact the status reporting as NSX Manager cannot talk to the controller to pick up the right status. To fix this issue, follow the below mentioned work around to fix the password expiration.

To work around the issue on NSX for vSphere 6.3.3, and to avoid encountering this issue while upgrading from NSX for vSphere 6.3.3 to a later version, VMware developed a signed script that recreates the status file required by the API server.

The workaround requires two signed scripts to be executed sequentially using REST API call to NSX Manager.

Download the attached signed_bsh_download_jar.encoded and signed_bsh_passwd_expiry_napi.encoded files.

Notes:

The same scripts are also mentioned in the workaround section of KB article Deploying NSX Controller fails in NSX-v 6.3.3 and 6.3.4 (51144). Running the scripts applies the same workaround for both KB articles.
While performing the below steps, the data path will not be impacted; however, the Controller status will be in the disconnected state for a brief period.

Run the following POST calls on NSX Manager:

Confirm IP connectivity from NSX manager to all the controllers using ping.
Proceed only after the IP connectivity is established.
Method: POST
URL: https://nsxmgr_ip/api/1.0/services/debug/script
Authentication: Basic authentication (Username : admin)
Headers: content-type - application/xml
Body : copy contents of the attached file signed_bsh_download_jar.encoded.
Expected Response: 200

Note: During copy/paste of the contents into the body, make sure no extra line/characters get added at the end to run the API successfully. Proceed to step-3 only if the response is 200.

File a support request with VMware support if the API call fails after multiple attempts.

Alternatively, you can use cURL:

curl -k -X POST -H "Content-Type: application/xml" -d "@signed_bsh_download_jar.encoded" -u user:password https://nsxmgr_ip/api/1.0/services/debug/script
Method: POST
URL: https://nsxmgr_ip/api/1.0/services/debug/script
Authentication: Basic authentication (Username : admin)
Headers: content-type - application/xml
Body : copy contents of the attached file signed_bsh_passwd_expiry_napi.encoded.
Expected Response: 200

As a part of Step 3, the API service in each of the Controller will be restarted sequentially; upon successful restart, the Controller status should show up as Connected in the NSX Manager Installation UI.

Note: If any or all of the Controllers are re-deployed, repeat the preceding steps again.

Alternatively, you can use cURL:

curl -k -X POST -H "Content-Type: application/xml" -d "@signed_bsh_passwd_expiry_napi.encoded" -u user:password https://nsxmgr_ip/api/1.0/services/debug/script

Additional Information

有时，由于清理 API 服务器参考文件，某些控制器 API 可能会失败
API サーバリファレンスファイルのクリーンアップのために特定のコントローラ API が失敗することがある

Attachments

signed_bsh_passwd_expiry_napi.encoded get_app

signed_bsh_download_jar.encoded get_app