Intermittent network issues during vSAN/vMotion traffic with qedentv driver
search cancel

Intermittent network issues during vSAN/vMotion traffic with qedentv driver

book

Article ID: 317657

calendar_today

Updated On:

Products

VMware vSphere ESXi

Issue/Introduction

Symptoms:
In rare circumstances involving heavy vSAN or vMotion ethernet network traffic, intermittent network issues due to the qedentv driver could potentially emerge, leading to the failure of vMotion or errors transacting over vSAN. For vSAN/vMotion traffic, under certain conditions, vmkernel instantiates netqueue or netqueue RSS as appropriate. In high load conditions, due to an intermittent timing issue with the qedentv driver, receive traffic does not flow correctly through the netqueue. When this failure condition occurs, it can result in problems such as heartbeat failures to be reported by vSAN and vMotion failures due to timeout connecting to destination host or incomplete transfer of VM memory pages. When this issue occurs, it will result in the MAC filter associated with the vmknic that is used for vSAN/vMotion traffic to rapidly move back and forth between the default queue and netqueue. It should be noted that MAC filter movement between default and netqueue is in itself normal. But when the filter movement happens quickly accompanied by other associated traffic failures, then it is a likely manifestation of this problem. In the vmkernel logs, messages similar to what are shown below will be repeatedly observed. There may also be vSAN/vMotion failure messages interspersed with driver messages.

vmnic2)]Removing mac:00:50:56:62:c8:9a, vlan_id:0x0, from fp:0, op:MAC_DEL, hw_fn:0
vmnic2)]Applying 00:50:56:62:c8:9a filter, vlan_id:0xffff, fp_id:1, hw_fn:0.
vmnic3)]Feature RSS needed.
<snip>
WARNING: VMotionUtil: 4060: 1195397824256221929 S: Stream completion work failed: Timeout
WARNING: Migrate: 273: 1195397824256221929 S: Failed: Timeout (0xbad0021) @0x4180146cb675
WARNING: VMotionUtil: 850: 1195397824256221929 S: failed to read stream keepalive: Connection closed by remote host, possibly due to timeout
<snip>
vmnic2)]Removing mac:00:50:56:62:c8:9a, vlan_id:0x0, from fp:1, op:MAC_DEL, hw_fn:0
vmnic2)]Applying 00:50:56:62:c8:9a filter, vlan_id:0xffff, fp_id:0, hw_fn:0.


Cause

The problem is due to a corner case timing condition in qedentv driver during netqueue delete operation that could lead to mismatch in indices within the interrupt generation logic on the adapter and impact receive traffic.

Resolution

This issue is fixed in qedentv driver version 3.11.7.0, so update this driver to 3.11.7.0 version or higher available at VMware Download .

Workaround:
Workaround would be to disable netqueues on qedentv interfaces. This can be done using driver module parameter as shown below. The example assumes there are four qedentv instances.

[root@host:~] esxcfg-module -g qedentv
qedentv enabled = 1 options = ''
[root@host:~] esxcfg-module -s "num_queues=0,0,0,0 RSS=0,0,0,0" qedentv
[root@host:~] esxcfg-module -g qedentv
qedentv enabled = 1 options = 'num_queues=0,0,0,0 RSS=0,0,0,0'


Reboot system for settings to take effect and will apply to all NICs managed by the qedentv driver.

It should be noted that disabling netqueue will result in some performance impact. The magnitude of the impact will depend on individual workloads and should be characterized before deploying the workaround in production. However, in most cases, the performance impact is not noticeable.