Ports connected to N-VDS show in a Blocked State
search cancel

Ports connected to N-VDS show in a Blocked State

book

Article ID: 321128

calendar_today

Updated On:

Products

VMware NSX Networking

Issue/Introduction

Symptoms:
-- Vmkernel Interfaces such as Management, vMotion connected to N-VDS, go into blocking state causing host to show disconnected in vSphere Client and the hosts are unmanageable
-- In the net-dvs output on ESXi host, you see that the link is in a blocked state with message "Port blocked by admin"

From the ESXi, you can run the following to verify a port in this particular link state:

#net-dvs -l | grep -E "port |port.block|volatile.vlan|volatile.status"
port c2602212-cec6-4783-865a-2e9ae8fe6a71:
                com.vmware.common.port.block = true , propType = POLICY
                com.vmware.common.port.volatile.vlan = VLAN 0
                com.vmware.common.port.volatile.status = inUse linkUp blocked portID=67108871 Port blocked by admin propType = RUNTIME

Environment

VMware NSX-T Data Center
VMware NSX-T Data Center 2.x

Cause

This issue occurs if a Transport Node loses connectivity to its master controller node. 

Every Transport Node connects to the Central Control Plane via a Logical Control Plane (LCP) daemon. The CCP pushes the Logical Switch, Router and DFW information to the transport node via this LCP channel. 

In the event where the Transport Node loses connectivity to the controller, the controller will start a 24 hour countdown. After 24 hours, if the communication between the transport node and the controller is still down, the controller will purge the Transport Node information from its database.

Now if the host/controller communication is restored after 24 hours have expired, the controllers push an empty dataset to the host, resulting in all the ports on the ESXi host connected to N-VDS to go in a "Port Blocked state". This limitation was removed in VMware NSX-T Data Center 2.4.1 and later releases.

Resolution

This behavior is by design, to keep accurate inventory for the virtual network. VMware recommends monitoring of the communication channel between controllers and Transport Nodes so that a disruption of this nature can be addressed prior to the 24 hour timeout.

Workaround:
Workaround to be implemented to recover the environment after the issue has occurred

1. Identify all the ports that are in a Blocked state, by running the below command
net-dvs -l | grep -E "port |port.block|volatile.vlan|volatile.status"
2. Identify the port IDs for all connected interfaces on the ESXi host
net-stats -l
3. Match the output of 2 and 3 to see what ports correspond to which vNic interfaces
4. Identify the N-VDS switch name on the host
esxcfg-vswitch -l
5. Run below command to unblock the ports which are in blocked state
net-dvs -s com.vmware.common.port.block=false <N-VDS-NameFromStep4> -p <LogicalSwitchPor UUIDfromStep1>
Example:
net-dvs -s com.vmware.common.port.block=false N-VDS-hostswitch -p c2602212-cec6-4783-865a-2e9ae8fe6a71
6. In a rare event, the VLAN of the port maybe lost as well. To re-add the VLAN back, run:
net-dvs -v <VLAN-ID> -p <LogicalSwitchPortUUIDfromStep1> <N-VDS-NameFromStep4>
Example:
net-dvs -v 2081 -p c2602212-cec6-4783-865a-2e9ae8fe6a71 N-VDS-hostswitch

--------------------------------

Workaround to be implemented to increase the timeout from 24 hours (1 day) to 100 days.

Note: 
- This will delay the host disconnect issue by 100 days, if the host reconnects within 100 days the issue will not be seen.
- The workaround will NOT be useful for the hosts which have already been disconnected and CCP has expired the host data.
             
1. Login as root on the controllers.
2. vi /opt/vmware/ccp/bin/java/vnetcontroller/core.properties
3. Search for the following line in the core.properties file
            “connection.expire.period=<expiry-time-in-minutes>”
4. If this parameter is not present in core.properties, add the following line to increase timeout from 1 day to 100 days. The default value is 1440 minutes or 24 hours or 1 day
              “connection.expire.period=144000”
5. If this parameter is present in core.properties, modify the timeout value from 1440 to 14400
6. Restart the controller service (login as admin and run “restart service controller”)


Additional Information

-- Looking at the /var/log/nsxproxy.log in the ESXi host, log messages similar to below will be seen, which show the host lost connectivity to its master controller
2018-12-01T05:23:32Z nsx-proxy[69012]: [nsx@6876 comp="nsx-esx" subcomp="nsx-proxy" s2comp="nsx-net" tid="69018" level="INFO"] StreamConnection[80 Closing to ssl://10.208.214.45:1235 sid:80] Closing (reason: Remotely)
2018-12-01T05:23:32Z nsx-proxy[69012]: [nsx@6876 comp="nsx-esx" subcomp="nsx-proxy" s2comp="nsx-net" tid="69018" level="INFO"] StreamConnection[80 Closed to ssl://10.208.214.45:1235 sid:-1] Closed (reason: Remotely, error: 2-End of file)
2018-12-01T05:23:32Z nsx-proxy[69012]: [nsx@6876 comp="nsx-esx" subcomp="nsx-proxy" s2comp="nsx-rpc" tid="69018" level="INFO"] RpcConnection[80 Connected to ssl://10.208.214.45:1235] Closing (network error)
2018-12-01T05:23:32Z nsx-proxy[69012]: [nsx@6876 comp="nsx-esx" subcomp="nsx-proxy" s2comp="nsx-rpc" tid="69018" level="INFO"] RpcTransport[2] Unregistering remote endpoint ids '918c4331-98cd-4945-8322-ae92390a26f1' @ RpcConnection[80 Closed to ssl://10.208.214.45:1235]
2018-12-01T05:23:32Z nsx-proxy[69012]: [nsx@6876 comp="nsx-esx" subcomp="nsx-proxy" s2comp="nsx-rpc" tid="69018" level="INFO"] RpcTransport[2]::RemoteServiceManager Unregistering remote service 'vmware.nsx.h
andshake.VersionMastershipHandshake' @ RpcConnection[80 Closed to ssl://10.208.214.45:1235]
2018-12-01T05:23:32Z nsx-proxy[69012]: [nsx@6876 comp="nsx-esx" subcomp="nsx-proxy" s2comp="nsx-rpc" tid="69018" level="INFO"] RpcConnection[80 Closed to ssl://10.208.214.45:1235] Notifying channels on connection down (network error)
2018-12-01T05:23:32Z nsx-proxy[69012]: [nsx@6876 comp="nsx-esx" subcomp="nsx-proxy" s2comp="nsx-rpc" tid="69029" level="ERROR" errorCode="RPC101"] RpcCall:Client:ServerStreaming[vmware.nsx.handshake.VersionMastershipHandshake/GetMaster, 0x0001, LOCAL_DONE] Channel is down
2018-12-01T05:23:32Z nsx-proxy[69012]: [nsx@6876 comp="nsx-esx" subcomp="nsx-proxy" s2comp="nsx-rpc" tid="69018" level="INFO"] ConnectionKeeper[15 ssl://10.208.214.45:1235] resetting connection, will reconnect
2018-12-01T05:23:32Z nsx-proxy[69012]: [nsx@6876 comp="nsx-esx" subcomp="nsx-proxy" s2comp="nsx-rpc" tid="69029" level="ERROR" errorCode="RPC102"] RpcCall:Client:ServerStreaming[vmware.nsx.handshake.VersionMastershipHandshake/GetMaster, 0x0001, LOCAL_DONE] Is in error state (COMMUNICATION_ERROR, status is not reported to the Server)
2018-12-01T05:23:32Z nsx-proxy[69012]: [nsx@6876 comp="nsx-esx" subcomp="nsx-proxy" s2comp="nsx-rpc" tid="69018" level="INFO"] ConnectionKeeper[15 ssl://10.208.214.45:1235] closing and releasing connection cid:80
2018-12-01T05:23:32Z nsx-proxy[69012]: [nsx@6876 comp="nsx-esx" subcomp="nsx-proxy" tid="69029" level="INFO"] VersionMastershipHandshakeClient: received MasterResponse CallStatus COMMUNICATION_ERROR
2018-12-01T05:23:32Z nsx-proxy[69012]: [nsx@6876 comp="nsx-esx" subcomp="nsx-proxy" s2comp="nsx-rpc" tid="69018" level="INFO"] RpcConnection[80 Closed to ssl://10.208.214.45:1235] Closing (closed by user)
2018-12-01T05:23:32Z nsx-proxy[69012]: [nsx@6876 comp="nsx-esx" subcomp="nsx-proxy" s2comp="nsx-rpc" tid="69018" level="WARN"] RpcConnection[80 Closed to ssl://10.208.214.45:1235] is already closed
2018-12-01T05:23:32Z nsx-proxy[69012]: [nsx@6876 comp="nsx-esx" subcomp="nsx-proxy" tid="69029" level="INFO"] VersionMastershipHandshakeClient: connection down, detected

-- In the same log file, the Transport Node disconnected from its controller. Note that the IP address of the controller is in decimal format i.e {   ipv4: 181458477 } which translates to 10.208.214.45 in dotted decimal format.
2018-12-01T05:23:32Z nsx-proxy[69012]: [nsx@6876 comp="nsx-esx" subcomp="nsx-proxy" tid="69029" level="INFO"] Write ccp session message to nestdb ccp_id {   left: 10487831512535091525   right: 94493069112221
46801 } ip {   ipv4: 181458477 } server_port: 1235 fqdn: "" state: DISCONNECTED master: true

-- 24 hours after the Transport Node disconnected from the controller, the controller deletes the Transport Node data from its local database.
2018-12-02 05:23:51.729 UTC INFO  [pool-16-thread-1] [c.v.n.c.r.LcpReplicatorImpl] - [nsx@6876 comp="nsx-controller" subcomp="replication"] TN data was cleared (expired, removed or inMaintenance): 48051aaf-6cec-478b-a2c7-34a630743105

-- After several days, the transport node and controller communication is re-established
2018-12-07T01:57:09Z nsx-proxy[69012]: [nsx@6876 comp="nsx-esx" subcomp="nsx-proxy" s2comp="nsx-net" tid="69018" level="INFO"] StreamConnection[24262 Connected to ssl://10.208.214.45:1235 sid:24262] Connected from ssl-tcp://10.208.214.142:47322 to server with certificate with sha256 fingerprint '4aa22f69b93189889f74abf441c07c26f010ad1ad1275c6cc52a97f9b268f187'
2018-12-07T01:57:09Z nsx-proxy[69012]: [nsx@6876 comp="nsx-esx" subcomp="nsx-proxy" s2comp="nsx-rpc" tid="69018" level="INFO"] ConnectionKeeper[3469 ssl://10.208.214.45:1235] established connection cid:24262
2018-12-07T01:57:09Z nsx-proxy[69012]: [nsx@6876 comp="nsx-esx" subcomp="nsx-proxy" tid="69029" level="INFO"] VersionMastershipHandshakeClient: connected {net: {endpoint: ssl://10.208.214.45:1235, local-ep:
ssl-tcp://10.208.214.142:47322, peer-ep: ssl://10.208.214.45:1235, peer-auth: {certificate sha256 '4aa22f69b93189889f74abf441c07c26f010ad1ad1275c6cc52a97f9b268f187'}}}
2018-12-07T01:57:09Z nsx-proxy[69012]: [nsx@6876 comp="nsx-esx" subcomp="nsx-proxy" s2comp="nsx-rpc" tid="69018" level="INFO"] RpcTransport[2] Registering remote endpoint ids '918c4331-98cd-4945-8322-ae92390a26f1' @ RpcConnection[24262 Connected to ssl://10.208.214.45:1235]
2018-12-07T01:57:09Z nsx-proxy[69012]: [nsx@6876 comp="nsx-esx" subcomp="nsx-proxy" s2comp="nsx-rpc" tid="69018" level="INFO"] RpcTransport[2]::RemoteServiceManager Registering remote service 'vmware.nsx.handshake.VersionMastershipHandshake' @ RpcConnection[24262 Connected to ssl://10.208.214.45:1235]
2018-12-07T01:57:09Z nsx-proxy[69012]: [nsx@6876 comp="nsx-esx" subcomp="nsx-proxy" tid="69029" level="INFO"] VersionMastershipHandshakeClient: Connected
2018-12-07T01:57:09Z nsx-proxy[69012]: [nsx@6876 comp="nsx-esx" subcomp="nsx-proxy" tid="69029" level="INFO"] VersionMastershipHandshakeClient: created stub successfully
2018-12-07T01:57:09Z nsx-proxy[69012]: [nsx@6876 comp="nsx-esx" subcomp="nsx-proxy" tid="69029" level="INFO"] VersionMastershipHandshakeClient: version rpc sending transport_node {   left: 5189583486363387787   right: 11729401642891882757 } supported_version {   major: 2   minor: 3 } supported_version {   major: 2   minor: 1 }
2018-12-07T01:57:09Z nsx-proxy[69012]: [nsx@6876 comp="nsx-esx" subcomp="nsx-proxy" tid="69029" level="INFO"] VersionMastershipHandshakeClient: received VersionResponse supported_version {   major: 2   minor: 3 }
2018-12-07T01:57:09Z nsx-proxy[69012]: [nsx@6876 comp="nsx-esx" subcomp="nsx-proxy" tid="69029" level="WARN"] ChannelDetection new channel reconnect
2018-12-07T01:57:09Z nsx-proxy[69012]: [nsx@6876 comp="nsx-esx" subcomp="nsx-proxy" tid="69029" level="INFO"] VersionMastershipHandshakeClient: mastership rpc query sending transport_node {   left: 5189583486363387787   right: 11729401642891882757 }
2018-12-07T01:57:09Z nsx-proxy[69012]: [nsx@6876 comp="nsx-esx" subcomp="nsx-proxy" tid="69029" level="INFO"] VersionMastershipHandshakeClient: received  CallStatus SUCCESS
2018-12-07T01:57:09Z nsx-proxy[69012]: [nsx@6876 comp="nsx-esx" subcomp="nsx-proxy" tid="69029" level="INFO"] VersionMastershipHandshakeClient: received MasterResponse { master_node {   left: 10487831512535091525   right: 9449306911222146801 } }

-- Transport Node connects to its master controller
2018-12-07T01:57:09Z nsx-proxy[69012]: [nsx@6876 comp="nsx-esx" subcomp="nsx-proxy" tid="69029" level="INFO"] Write ccp session message to nestdb ccp_id {   left: 10487831512535091525   right: 9449306911222146801 } ip {   ipv4: 181458477 } server_port: 1235 fqdn: "" state: CONNECTED master: true

-- In Controller nsx-ccp logs, see the communication channel re-establishing
2018-12-07 01:56:38.813 UTC INFO  [nsx-rpc:CCP-918c4331-98cd-4945-8322-ae92390a26f1:user-executor-0] [c.v.n.h.VersionMastershipServiceImpl] - [nsx@6876 comp="nsx-controller" subcomp="handshake-server"] Versi
on handshake succeeded for 48051aaf-6cec-478b-a2c7-34a630743105
2018-12-07 01:56:38.815 UTC INFO  [nsx-rpc:CCP-918c4331-98cd-4945-8322-ae92390a26f1:user-executor-3] [c.v.n.h.VersionMastershipServiceImpl] - [nsx@6876 comp="nsx-controller" subcomp="handshake-server"] Open
mastership for 48051aaf-6cec-478b-a2c7-34a630743105, controller is myself 918c4331-98cd-4945-8322-ae92390a26f1
2018-12-07 01:56:38.815 UTC INFO  [nsx-rpc:CCP-918c4331-98cd-4945-8322-ae92390a26f1:user-executor-3] [c.v.n.r.c.NsxRpcChannel] - [nsx@6876 comp="nsx-controller" subcomp="ccp"] ClientChannel(b550f158-9659-4c8
e-8707-cff279fce1e9) doOpenStream OutboundMessage(streamId=null payloadSize=0 streamControl=service: "vmware.nsx.nestdb.NestDb"open_stream: truepriority: 64 frame=)

 -- An empty dataset is pushed to the Transport Node by its master controller
2018-12-08 01:58:51.734 UTC INFO  [Curator-PathChildrenCache-0] [c.v.n.c.r.LcpReplicatorImpl] - [nsx@6876 comp="nsx-controller" subcomp="replication"] Data set becomes empty, transportNodeId=[48051aaf-6cec-478b-a2c7-34a630743105], process=[Store update].

-- In the host vmkernel.log file, you would see messages such as below, indicating that the port is in a blocked state
2018-12-07T01:57:10.790Z cpu1:66668)NetPort: 3203: blocking traffic on DV port 86762494-05a4-40c2-9da3-6419b5fa1b62
2018-12-07T01:57:10.790Z cpu1:66668)kcp: KCP_DeletePort:459: [nsx@6876 comp="nsx-esx" subcomp="kcp"]Port 67108875 is cleared and blocked
--
2018-12-07T01:57:10.791Z cpu1:66668)NetPort: 3203: blocking traffic on DV port d331e025-0983-4825-8693-24279742c6ce
2018-12-07T01:57:10.791Z cpu1:66668)kcp: KCP_DeletePort:459: [nsx@6876 comp="nsx-esx" subcomp="kcp"]Port 67108870 is cleared and blocked
--
2018-12-07T01:57:10.792Z cpu1:66668)NetPort: 3203: blocking traffic on DV port 53588a39-fabc-4bce-86b1-29477b28a83c
2018-12-07T01:57:10.792Z cpu1:66668)kcp: KCP_DeletePort:459: [nsx@6876 comp="nsx-esx" subcomp="kcp"]Port 67108871 is cleared and blocked
--