Network disruption observed after upgrade to NSX-T 3.1.0/3.1.1
search cancel

Network disruption observed after upgrade to NSX-T 3.1.0/3.1.1

book

Article ID: 336822

calendar_today

Updated On:

Products

VMware NSX Networking

Issue/Introduction

Symptoms:
  •     NSX-T Data Center 3.1.0/3.1.1 upgraded from an earlier version
  •     NSX-T Data Center 3.1.2 if upgraded from 3.1.0 or 3.1.1
  •     The following symptoms may be observed
    •  Network traffic that should be allowed by the Tier-0 or Tier-1 Gateway default Allow rule is not working as expected
    •  NAT rules are not working as expected
  •     The Tier-0 or Tier-1 Gateways impacted have a "-" symbol in their name
  •     Tier-0 or Tier-1 Gateways deployed new after upgrade are not impacted
  •     Confirm the status of the default Allow policy:
    •  Switch to Manager UI view. If this option is not available, enable it under System -> User Interface Settings
    •  Navigate to Security -> Edge Firewall
    •  From the dropdown select the Gateway
    •  Confirm there are 2 default policy sections at the bottom of the firewall, one of them is named "Policy_Default_Infra"
    •  Confirm if the uppermost default policy section is Stateful or Stateless
  •     In the case of an Active/Active Tier-0, the uppermost default section must be Stateless
  •     In the case of an Active/Standby Tier-0 or Tier-1, the uppermost default section must be Stateful
  •     Note in some cases a duplicate default rule may also be seen on the Policy UI also


Environment

VMware NSX-T Data Center
VMware NSX-T Data Center 3.x

Cause

This issue occurs during upgrade to NSX-T 3.1.0/3.1.1 when a Tier-0 or Tier-1 Gateway, with a "-" symbol in its name has its Gateway Firewall processed incorrectly. This results in 2 default firewall sections.
This configuration can result in network disruption due to incorrect NAT or FW rule handling.

Resolution

This issue is resolved in VMware NSX-T Data Center 3.1.2, available at VMware Downloads .

Workaround:
Note this issue is resolved in NSX-T 3.1.2 which prevents this issue occurring. However if an environment on NSX-T 3.1.0 or 3.1.1 is already impacted, an upgrade to 3.1.2 will not resolve the issue.

This issue is seen only for traffic flows that are allowed by the default rule.

Creating a new catchall allow rule just above the default policy section will remediate the issue.
Create this allow rule on the Policy UI and make the section stateful for A/S routers and make stateless for A/A Gateways.

To permanently resolve the issue, the duplicate default rule must be removed.
This involves 2 steps, running an API delete call and then running the attached script.
  • From the Manager UI, copy the UUID of the "Policy_Default_Infra" section
  • Expand the Policy_Default_Infra and make note of the firewall rule ID
  • DELETE https://NSX_MGR/api/v1/firewall/sections/<section-UUID>/rules/<rule-id>;
    • Header X-Allow-Overwrite: true
    • The problem firewall rule is a protected Policy object and so the API header X-Allow-Overwrite: true must be used.
  • The delete operation will remove the problem rule
  • After the delete operation completes, the temporary Allow rule added above can be removed

The script attached to this KB is a python script and can be run from any machine with python3 installed which has network connectivity to the Manager.
Alternatively if root access to the NSX Manager is allowed the script can be run directly there.

On a 3rd party machine, if necessary Install python pre-requisites library 'requests'
#pip3 install requests

Script usage:
On Linux machine:
python3 default_policy_section_cleanup.py -m <NSX Manager's IP>  -u 'admin' -p 'Admin!23Admin'
(note single quotes must be used for the password)

On Mac machine:
python3 default_policy_section_cleanup.py -m <NSX Manager's IP>  -u "admin" -p "Admin\!23Admin"
(note \ must be used to escape special characters)

Attachments

default_policy_section_cleanup.py get_app