Network connectivity issues after upgrade in NSX/VCNS environment
search cancel

Network connectivity issues after upgrade in NSX/VCNS environment

book

Article ID: 344127

calendar_today

Updated On:

Products

VMware NSX Networking

Issue/Introduction

Symptoms:
In VMware NSX for vSphere 6.x or VMware vCloud Networking and Security (VCNS) 5.x, after upgrading the cluster or during an ESXi host unprepare/reprepare task, you experience these symptoms:
  • NSX upgrade fails
  • Connecting virtual machines to a logical switch fails
  • Re-connecting the virtual machine to a logical switch fail
  • You see the error:

    Failed to connect virtual device ethernet0

  • In the /var/log/vmkernel.log file on the ESXi host, you see entries similar to:

    <YYYY-MM-DD>T<time> .833Z cpu16:1138415)WARNING: vxlan: VDL2PortPropSet:672: Failed to set control plane property for port[0x4000016] on VDS[DvsPortset-0] :</time> Would block
    <YYYY-MM-DD>T<time> .833Z cpu16:1138415)WARNING: NetPort: 1431: failed to enable port 0x4000016: </time>Would block

    Note: The Would block string is the most definitive symptom of this issue. It is also possible to see the Would block error in your environment when using a mis-configured Host Profile. VMware does not recommend using Host Profiles to create the VXLAN Tunnel End Point (VTEP). In this case, it is different from the issue that is described on this article. For more information, see the Updating the host profiles section in Deploying VXLAN through Auto Deploy and VMware NSX for vSphere 6.x (2092871).
  • The (VTEP) is not created on the ESXi host.
  • Running the command esxcli network vswitch dvs vmware vxlan list displays no output.
  • In the vSphere Client or the vSphere Web Client on the vSphere Distributed Switch (vDS), you see entries similar to:

    com.vmware.net.overlay.class.missing
    com.vmware.vxlan.instance.notexist
  • In the /var/log/vmkernel.log file on the ESXi host, you see entries similar to:

    <YYYY-MM-DD>T<time>.480Z cpu24:33371)Net: 221: Created new Tcpip Instance at 0x410a716b2fc0, index 1, name: vxlan, ccalgo: newreno, socketMax: 11000
    <YYYY-MM-DD>T<time>.492Z cpu24:33371)NetOverlay: 1218: class name:vxlan not found
    <YYYY-MM-DD>T<time>.492Z cpu24:33371)NetOverlay: 1326: failed to install overlay instance vxlan: Not found
    <YYYY-MM-DD>T<time>.492Z cpu24:33371)NetOverlay: 1496: Instantiate overlay class:[vxlan] on vds:[DvsPortset-0] failed
    <YYYY-MM-DD>T<time>.492z cpu24:33371)WARNING: vdrb: VdrCreateVxlanTrunk:59: CNXN:[C:dvSwitch-xxxx,P:33554446] Failed to Create VXLAN trunk status: Would block</time></time></time></time></time>
  • On a vCloud Director environment, deployment of the Edge fails and you see entries similar to:

    Cannot deploy organization VDC network (5b81120a-b0b2-4c54-8861-80633e5dad5c)
    Failed to connect interface of edge gateway edge_gateway_name to organization VDC network VDC_network_name
    java.util.concurrent.ExecutionException: com.vmware.vcloud.fabric.nsm.error.VsmException: VSM response error (10014): Configuration failed on NSX Edge vm vm-1922. ([65280] /sbin/ifconfig vNic_1 up failed :

    SIOCSIFFLAGS: Invalid argument

Note: The preceding log excerpts are only examples. Date, time, and environmental variables may vary depending on your environment.


Environment

VMware vCloud Networking and Security 5.5.x
VMware NSX for vSphere 6.1.x
VMware NSX for vSphere 6.0.x
VMware vCloud Networking and Security 5.1.x

Cause

When a prepared ESXi host boots, the vCenter Server pushes the vSphere Distributed Switch (vDS) configuration properties to the ESXi host. The NSX VXLAN VIB must be loaded on the ESXi host before the vDS settings are pushed from the vCenter Server. In other words, the ESXi host must be prepared for VXLAN and be able to accept the configuration.

Run this command to confirm that the vDS is prepared for VXLAN:

esxcli network vswitch dvs vmware vxlan list

On a correctly prepared system, the vDS returns output similar to:

VDS ID VDS Name MTU Segment ID Gateway IP Gateway MAC Network Count Vmknic Count
----------------------------------------------------------------------------------------------------------------------------------------
9f 6e 2d 50 94 25 9b d2-0b 16 0e d8 20 d5 26 ea 1-vds-581 1600 172.19.0.0 172.19.0.1 00:26:98:02:ee:41 0 1


Note: On a system with this issue, no vDS details are displayed.

VMware NSX for vSphere uses a live update process whereby the VXLAN and other VIBs packaged in the NSX offline bundle are first un-installed and then re-installed during an upgrade or when an ESXi host is removed from a cluster and added back to the cluster, such as during a maintenance window. Specifically, during the installation process, the VIBs should be installed in the running image as well as writing to the boot disk. If the installation fails and the NSX VIBs are not copied to the boot disk, an ESXi host works as expected until the next reboot. Upon that reboot, NSX host preparation fails as no VIBs are available in the running image of the system.

Note: The NSX host preparation refers to both the VXLAN VIB and the DFW VIB therefore, a live update failure also impacts DFW for both VLAN-based and VXLAN-based networks.

In some scenarios, host preparation fails because of one of the following:
  • Some system event is preventing the VIBs from writing NSX vib to the disk. For example, VMware is aware of a scenario in which third-party VIBs packaged for certain hardware support places a lock on the Intel ixgbe network interface driver and in turn prevent the NSX VIBs from writing to the disk for next reboot. In other words, a partial installation has taken place.

    VMware NSX for vSphere releases 6.1.5 and 6.2.0 allows the NSX installation process to continue even if a third-party VIB prevents the unloading of the ixgbe driver, which is required to enable RSS. In these releases, RSS may not be enabled, and the following error messages should be checked in /var/log/syslog.log file on the affected ESXi host.

    2015-09-02T18:11:13Z init.d/vxlan-vib: ERROR: Failed to load ixgbe with RSS
    File "/etc/init.d/vxlan-vib", line 243, in <module>
    File "/etc/init.d/vxlan-vib", line 199, in updateIxgbeDriver

    Note: The preceding log excerpts are only examples. Date, time, and environmental variables may vary depending on your environment.

  • The vCenter Server and EAM fails to connect successfully to vSphere Update Manager (VUM), and typically only some of the ESXi hosts in a cluster fails on the installation process. As emphasized above, the would block condition occurs whenever a prepared NSX host reboots without the NSX VIBs in the booting disk. This is the only cause of the problem.

    For DFW-only NSX use cases, DFW starts to work as expected after the re-install by EAM. However, this remediation does not happen right away after the reboot, Creating a small window when DFW policies are not being enforced. Thus, it is critical to understand the reason for the NSX VIB installation failure and correct it.

    If re-installing the NSX VIBs takes a very long time, NSX host preparation is considered unsuccessful, and the NSX Manager User Interface (UI) turns the host preparation status to red or in a Not Ready state. In this scenario, the DFW module attempts to push the rules to the cluster after reboot by way of the NSX message bus. The DFW module waits for two minutes to receive a confirmation from the ESXi hosts that the rules have been received and applied.

    Note: For the case of moving hosts in or out of an NSX prepared cluster, these actions require that VXLAN be un-configured first before any host removal action is taken. For the case of moving hosts in or out of a cluster, each move operation should be performed only after VXLAN is un-configured, even when moving the host to another prepared NSX cluster.

Resolution

To address these scenarios under which the would block occurs, VMware introduced changes on both VMware vSphere and NSX for vSphere.

VMware vSphere

  • In VMware vSphere 6.0 Update 1, host scan operation by EAM fails but NSX host preparation status shows as green incorrectly as the returned unhandled error 99. In releases with a fix for this issue, EAM correctly reports that the host scan failed and raises a vibNotInstalled state. The following example message from the /var/log/eam.log file on the affected ESXi host shows the error code 99.

    INFO | 2015-02-25 11:41:32,221 | host-4 | VcPatchManager.java | 356 | Scan result on VcHostSystem(host-383):
    ScanResult {
    errorCode = 99,
    vibUrl = 'https://172.16.224.232/bin/vdn/vibs/5.5/vxlan.zip',
    responseXml = <unset>,
    bulletins = [],

VMware NSX for vSphere

A workaround and a new NSX Manager log messages are implemented in VMware NSX for vSphere 6.1.5 and VMware NSX for vSphere 6.2.0:

The NSX workaround consists of a detection logic in which NSX Manager checks the preparation status of a host after it is rebooted. If the host is found in the would block state, NSX Manager takes the corrective action by resetting the vSphere Distributed Switch properties, then the VTEP properties, and finally the port properties on the ESXi host. This corrective action repairs the VXLAN configuration without a second host reboot.

In addition, NSX Manager provides an alert that the would block condition was detected and resolved through the following log messages:

  • VXLAN opaque data is missing on host host-351, repushing the opaque data.

    Note: This log indicates that the remediation has started.
  • Reseted the vxlan opaque properties on host host-351.

    Note: This log indicates that the remediation has completed.
  • Not detected VXLAN opaque data missing on host-351, skip repush the opaque data.

    Note: This log indicates that the ESXi host has not encountered any issues on the reboot.

Some virtual machines may start running on the ESXi host before the NSX Manager finishes the detection and remediation described above. These virtual machines remains disconnected even after the ESXi host is repaired. These virtual machines must be connected back the network.

Following the completion of the remediation process, you need to resync the message bus using API.

Notes: Before performing the steps, ensure that:
  • You have basic authorization with the NSX Manager web credentials such as the admin user, or any vCenter user granted NSX privileges.
  • Headers Content-type: application/xml and Accept: application/xml are used.
For more information on how to make API calls to the NSX Manager, see the Using the NSX REST API section in the VMware NSX for vSphere API Guide.

To re-sync the message bus, use REST API:

Request:

POST https://NSX_Manager_IP/api/2.0/nwfabric/configure?action=synchronize

Request Body:

<nwFabricFeatureConfig>
<featureId>com.vmware.vshield.vsm.messagingInfra</featureId>
<resourceConfig>
<resourceId<{HOST/CLUSTER MOID}</resourceId>
</resourceConfig>
</nwFabricFeatureConfig>


Note: To further troubleshoot a would block condition, collect the logs immediately after the Message Bus is re-sync.

If the NSX Agent has already been deployed, attempt the following solutions in sequence to install the NSX VIBs:

Note: NSX 6.1.4 & earlier VIBs can be downloaded from: https://vsm-ip/bin/vdn/vibs/5.5/vxlan.zip.
  1. If you see Resolve in the Installation Status column, click Resolve and then refresh your browser window. For more information, see the Prepare Hosts on the Primary NSX Manager section of the Cross-vCenter NSX Installation Guide.
  2. Standard VIB install with Update Manager or with the esxcli command.

    For more information on how to install a VIB on an ESXi host, see Downloading and installing async drivers in VMware ESXi 5.x and ESXi 6.0.x (2005205). Also, see Installing patches on an ESXi 5.x/6.x host from the command line (2008939)

    Note: After the above commands are run:
    • Reboot each of the ESXi host where the force install succeeded. (For example, VIBs are not skipped)
    • NSX Installation tab resolve on cluster is required to detect that VIBs are now correctly installed

    If the NSX Agent has not been deployed:

    • Remove ESX from cluster
    • Reboot
    • Add ESX to cluster

What is fixed in vSphere 6.0 Update 2

  • In vCenter Server 6.0U2 and ESXi 5.5P8, for the case of live NSX VIB installation on new (not yet prepared) hosts, EAM now reports a partial installation (For example: NSX VIBs have been installed on the running image but not on boot disk). In vSphere versions without this change, an error condition, such as another VIB (For example, ixgbe driver VIB), can prevent the NSX VIBs from being copied to the boot disk, and EAM reports a successful NSX installation / Host Ready status.
  • A new esxupdate error code with reboot the host immediate is reported when there is a live VIB installation failure caused by jumpstart plugins, rc scripts or init.d scripts.
  • A new error code with Please reboot the host immediately to discard the unfinished message is reported when there is a live NSX VIB installation failure. The failed NSX VIB is not reported as installed. The new error code and message are reported to NSX User Interface (UI).

Detect if alt-bootbank is not up to date and do a force install

To detect if the alt-bootbank is not up-to-date, use the --dry-run option. Output is installed (instead of skipped) if alt-bootbank is not up to date, run this command:

esxcli software vib install --no-live-install --force --dry-run --depot /path/vxlan.zip

Key:
--force | -f
Bypasses checks for package dependencies, conflicts, obsolescence, and acceptance levels. Really not recommended unless you know what you are doing. Use of this option will result in a warning being displayed in the vSphere Client.

--no-live-install
Forces an install to /altbootbank even if the VIBs are eligible for live installation or removal. This causes the installation to be skipped on PXE-booted hosts.

--dry-run
Performs a dry-run only. Report the VIB-level operations that would be performed, but do not change anything in the system.

--depot | -d
Specifies full remote URLs of the depot index.xml or server file path pointing to an offline bundle .zip file.

For more information, see the esxcli software section of the vSphere Command-Line Interface Documentation.

On any ESXi host where the alt-bootbank is not up to date, run the same command without the --dry-run option.

esxcli software vib install --no-live-install --force --depot /path/vxlan.zip


Additional Information

To automate the above procedure on detecting issues on multiple ESXi host using Windows batch file to automate vCLI:
  • Check if standard image shows VIBs installed.
  • Check if ESXi host is currently connected to VXLAN.
  • Test Force install VIBs to bring alternate boot bank up to date.

    set VI_USERNAME=root
    set VI_PASSWORD=password
    @echo off
    for %%e in (ESX01 ESX02 ESX03 …) do (
    echo %%e
    REM Check: standard image shows vibs installed
    "c:\Program Files (x86)\VMware\VMware vSphere CLI\bin\esxcli" -s %%e software vib list | findstr -i "vxlan vsip switch-security" | find /C "VMware"
    REM test if ESX currently connected to VXLAN
    "c:\Program Files (x86)\VMware\VMware vSphere CLI\bin\esxcli" -s %%e network vswitch dvs vmware vxlan list
    REM Check force install does not succeed i.e. alternate boot bank is up to date
    "c:\Program Files (x86)\VMware\VMware vSphere CLI\bin\esxcli" -s %%e software vib install --no-live-install --force --dry-run --depot /vmfs/volumes/55d393fd-0de6432f-19b2-8cdcd4b521e8/NSX/vxlan.zip
    )

    For more information on vCLI, see the vSphere Command-Line Interface Documentation.

PowerCLI method:

$vcServer = "vCenter01"
$cluster = "CL01"
$esxCred = Get-Credential
#Connect to vCenter
Connect-VIServer $vcServer | Out-Null
#Connect to ESX hosts in cluster
foreach ($esx in Get-Cluster $cluster | Get-VMHost) {
Connect-VIServer $esx -Credential $esxCred | Out-Null
}
#Retrieve the esxcli instances and loop through them
foreach($esxcli in Get-EsxCli) {
$esxcli.software.vib.list() | Where-Object {$_ -like "vxlan"}
$esxcli.network.vswitch.dvs.vmware.vxlan.list()
$esxcli.software.vib.install( --no-live-install --force --dry-run –depot)

}

#Disconnect from ESX hosts
foreach ($esx in Get-Cluster $cluster | Get-VMHost) {
Disconnect-VIServer $esx.name -Confirm:$false
}
#Disconnect from vCenter
Disconnect-VIServer $vcServer -Confirm:$false | Out-Null


For more information on PowerCLI, see the vSphere PowerCLI Documentation.

To be alerted when this document is updated, click the Subscribe to Article link in the Actions box.

How to download and install async drivers in ESXi 5.x/6.x
“esxcli software vib” commands to patch an ESXi 5.x/6.x host
Deploying VMware NSX for vSphere 6.x through Auto Deploy
在 NSX/VCNS 环境中升级后出现网络连接问题
Recreating VDR instances fails after rebooting an ESXi host in VMware NSX for vSphere 6.x
NSX/VCNS 環境でのアップグレード後のネットワーク接続の問題