NSX-v prepared ESXi host may observe a PSOD with "Virtual Infrastructure Latency" in VRNI 4.2 and higher
search cancel

NSX-v prepared ESXi host may observe a PSOD with "Virtual Infrastructure Latency" in VRNI 4.2 and higher

book

Article ID: 318909

calendar_today

Updated On:

Products

VMware Aria Operations for Networks VMware NSX Networking

Issue/Introduction

Symptoms:
  • ESXi hosts prepared for NSX-V 6.4.5 or NSX-V 6.4.6 VIBs may see a Purple Diagnostic Screen similar to:

    Virtual Infrastructure Latency
  • Following the stack trace of the PSOD, you see entries similar to:

    #0 DLM_free (msp=0x431a455dcca0, mem=mem@entry=0x431a458cbd10, allowTrim=allowTrim@entry=1 '\001') at bora/vmkernel/main/dlmalloc.c:4924
    #1 0x0000418012343ffa in Heap_Free (heap=0x431a455dc000, mem=<optimized out>, mem@entry=0x431a458cbd10) at bora/vmkernel/main/heap.c:4314
    #2 0x000041801222db25 in vmk_HeapFree (heap=<optimized out>, mem=mem@entry=0x431a458cbd10) at bora/vmkernel/core/vmkapi_heap.c:250
    #3 0x000041801393ca61 in __VDL2_Free (heapID=<optimized out>, data=data@entry=0x431a458cbd10) at /build/mts/release/bora-13168956/esx-datapath/modules/vdl2/vdl2.c:152
    #4 0x0000418013950caf in VDL2_CPTaskFree (task=task@entry=0x431a458cbd10) at /build/mts/release/bora-13168956/esx-datapath/modules/vdl2/vdl2_ctlplane.c:164
    #5 0x0000418013949415 in VDL2CPWorldProcessTask (task=0x431a458cbd10) at /build/mts/release/bora-13168956/esx-datapath/modules/vdl2/vdl2_cpworld.c:283
    #6 VDL2CPWorldFunc (data=data@entry=0x0) at /build/mts/release/bora-13168956/esx-datapath/modules/vdl2/vdl2_cpworld.c:335
    #7 0x0000418012308adf in vmkWorldFunc (data=<optimized out>) at bora/vmkernel/main/vmkapi_world.c:528
    #8 0x00004180124c91f5 in CpuSched_StartWorld (destWorld=<optimized out>, previous=<optimized out>) at bora/vmkernel/sched/cpusched.c:10792
    #9 0x0000000000000000 in ?? ()

     
  • In the /var/log/vmkernel.log file of the ESXi host, you see entries indicating that BFD was enabled on the host similar to:

    # cpu75:68603 opID=6616a61a)vxlan: VDL2PortsetPropSet:1036: Updating BFD VTEP config to : enable
    # cpu75:68603 opID=6616a61a)BFD: BFD_CreateNewSession ENTER: localIP: a.b.c.d , remoteIP: w.x.y.z , probeInterval (in milli seconds): 12000
    # cpu75:68603 opID=6616a61a)WARNING: BFD: Inserted new session: Discriminator 1471713223, localIP: a.b.c.d remoteIP: w.x.y.z

     
  • From the ESXi core dump, these BFD messages can be seen (BFD state change: init→up)

    less vmkernel-zdump.1
         
            vers:1 diag:"No Diagnostic" state:up mult:3 length:24
            flags: pol
            my_disc:0x50c322ca your_disc:0x39f2436f
            min_tx:300000us (300ms)
            min_rx:12000000us (12000ms)
            min_rx_echo:0us (0ms)(null): BFD state change: init->up "No Diagnostic"->"No Diagnostic".(null): New remote min_rx.
            vers:1 diag:"No Diagnostic" state:up mult:3 length:24
            flags: pol
            my_disc:0x5a566ae8 your_disc:0x16f3890c
            min_tx:300000us (300ms)
            min_rx:12000000us (12000ms)
            min_rx_echo:0us (0ms)(null): BFD state change: init->up "No Diagnostic"->"No Diagnostic".(null): New remote min_rx.


Environment

VMware vRealize Network Insight 5.x
VMware NSX Data Center for vSphere 6.4.x
VMware vRealize Network Insight 4.x

Cause

VRNI's Virtual Infrastructure Latency feature uses NSX enabled hosts’ BFD service to establish tunnels between hosts. The PSOD occurs when NSX kernel module is responding to a BFD tunnel detailed query from the control plane agent with all the BFD sessions’ states maintained by the kernel. 

Note: PSOD is not observed if the number of BFD tunnels are in few hundreds, only if the number tunnels are above 900. 

To determine the number of BFD tunnels in the environment, use the below formula:  

((# of hosts -1) x (# of VTEPs per host)  ^  2)
 
For example,  in a cluster of 4 hosts with 2 VTEPs each The number of tunnels each host would see is:
 
((4-1) x (2) ^ 2) VTEPs  = 12

Resolution

This issue is resolved in VMware NSX for vSphere 6.4.8, available at VMware Downloads .


Workaround #1

Use this option if the Virtual Infrastructure Latency feature is enabled through VRNI
  1. Navigate to to Settings > Accounts and Datasource.
  2. Edit the NSX Manager Datasource and ensure to disable the option “Enable Virtual Infrastructure Latency” by unchecking the box.
  3. Click Submit to confirm the change.

    Enable Virtual Infrastructure Latency

Workaround #2

Use this option if the VRNI appliance is not accessible or if the Virtual Infrastructure Latency feature is enabled through NSX API

Use GET API to determine BFD status

GET /api/2.0/vdn/bfd/configuration/global

Response:

<bfdGlobalConfiguration>
      <enabled>true</enabled>
      <pollingIntervalSecondsForHost>180</pollingIntervalSecondsForHost>
      <bfdIntervalMillSecondsForHost>120000</bfdIntervalMillSecondsForHost>
</bfdGlobalConfiguration>

Use PUT API to change BFD enable configuration

PUT /api/2.0/vdn/bfd/configuration/global

Request Body:

<bfdGlobalConfiguration>

     <enabled>false</enabled> 
     <pollingIntervalSecondsForHost>180</pollingIntervalSecondsForHost>
     <bfdIntervalMillSecondsForHost>120000</bfdIntervalMillSecondsForHost>
</bfdGlobalConfiguration>

 

Irrespective of the workaround used, ensure to permanently block the API that enables BFD in NSX manager until a permanent fix is available

 

Steps to block BFD API in NSX manager 

Take an FTP backup of the manager before executing the below steps

  • Login to the NSX manager putty and navigate to root mode of NSX manager
  • Navigate to the below location within root mode - /usr/appmgmt-webserver/webapps/ROOT/WEB-INF/classes/
  • Create a temporary directory in the below location and take a backup of “mapping.conf” file
cd /usr/appmgmt-webserver/webapps/ROOT/WEB-INF/

mkdir temp

cp mapping.conf /usr/appmgmt-webserver/webapps/ROOT/WEB INF/temp/mapping.conf.orig

 

  • Edit mapping.conf file at /usr/appmgmt-webserver/webapps/ROOT/WEB-INF/classes/mapping.conf
  • Add Entry with prefix ‘I’ as shown below to block API. For example, below line will block BFD API/api/2.0/vdn/bfd/configuration/global and save the file
  • Restart appliance management service  to take the configuration into effect

/etc/init.d/app-mgmt restart

  • During the restart of the service NSX UI wouldn’t be available for few seconds
  • To verify the changes, execute the REST API that enables BFD (GET or PUT) and the response will be a 404 / 403 in app mgmt logs

Sample Output

Request: 

root@vmware:~# curl -H "Content-Type: text/xml" -k -u username:password -X PUT https://<NSX manager IP>/api/2.0/vdn/bfd/configuration/global

Response:

<!doctype html><html lang="en"><head><title>HTTP Status 403 – Forbidden</title><style type="text/css">h1 {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;font-size:22px;} h2 {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;font-size:16px;} h3 {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;font-size:14px;} body {font-family:Tahoma,Arial,sans-serif;color:black;background-color:white;} b {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;} p {font-family:Tahoma,Arial,sans-serif;background:white;color:black;font-size:12px;} a {color:black;} a.name {color:black;} .line {height:1px;background-color:#525D76;border:none;}</style></head><body><h1>HTTP Status 403 – Forbidden</h1><hr class="line" /><p><b>Type</b> Status Report</p><p><b>Description</b> The server understood the request but refuses to authorize it.</p><hr class="line" /></body></html>root@vmware:~#

 

Request: 

root@vmware:~# curl -H "Content-Type:application/xml" -k -u username:password -X GET https://<NSX manager IP>/api/2.0/vdn/bfd/configuration/global

Response: 

<!doctype html><html lang="en"><head><title>HTTP Status 404 – Not Found</title><style type="text/css">h1 {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;font-size:22px;} h2 {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;font-size:16px;} h3 {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;font-size:14px;} body {font-family:Tahoma,Arial,sans-serif;color:black;background-color:white;} b {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;} p {font-family:Tahoma,Arial,sans-serif;background:white;color:black;font-size:12px;} a {color:black;} a.name {color:black;} .line {height:1px;background-color:#525D76;border:none;}</style></head><body><h1>HTTP Status 404 – Not Found</h1><hr class="line" /><p><b>Type</b> Status Report</p><p><b>Message</b> &#47;api&#47;2.0&#47;vdn&#47;bfd&#47;configuration&#47;global</p><p><b>Description</b> The origin server did not find a current representation for the target resource or is not willing to disclose that one exists.</p><hr class="line" /></body></html>root@vmware:~#