RX packet drops seen on network adapters due to page allocation failure "Failed to allocate all, init'ed rx ring"
search cancel

RX packet drops seen on network adapters due to page allocation failure "Failed to allocate all, init'ed rx ring"

book

Article ID: 318584

calendar_today

Updated On:

Products

VMware vSphere ESXi

Issue/Introduction


This article is to provide the information to troubleshoot the issue and how to work around the current issue.

Symptoms:
  1. RX packet drops seen on NICs using bnxtnet driver.

  2. Entries like the following are seen in ESXi kernel logs.

2021-02-26T14:03:24.801Z cpu91:2097484)WARNING: bnxtnet: alloc_rx_buffers:2094: [vmnic2 : 0x45033c788000] Failed to allocate all, init'ed rx ring 2 with 2822/4092 pages only
2021-02-26T14:03:24.820Z cpu91:2097484)WARNING: bnxtnet: alloc_rx_buffers:2094: [vmnic1 : 0x45033c7b6000] Failed to allocate all, init'ed rx ring 7 with 22/3069 pages only

  1. The packet page pool memory is exhausted.

Below is the command to check the usage infomation of packet page pool(netPktPagePool). if consumed or consumedPeak value was already reach/close to max, the issue would be observed.
[root@esxhost:~] memstats -r group-stats -s gid:name:max:consumed:consumedPeak -u mb | grep netPktPagePool
gid   name               max   consumed consumedPeak
----  ----------------   ----- -----    -----

163   netPktPagePool     1260  1260     1260

  1. When the driver is operating in the Enhanced Datapath / Enhanced Network Stack (ENS), there is a higher probability of occurrence due to higher NIC RX ring usage. There is also a higher probability of occurance in ESXi 6.7 version earlier than 6.7 Patch 6, and ESXi 7.0 versions earlier than 7.0 Update 1, due to very small default NetPagePool size limit in these versions.

  2. mtu >= 4000 or enabling HW LRO.



Environment

VMware ESXi 6.7.x
VMware vSphere ESXi 7.0.x
VMware ESXi 6.5.x

Cause

The bnxtnet driver uses a special pool of memory, known as NetPagePool, to receive LRO packets or packets larger than 4K. When many NIC RX rings are in use to process these types of packets, the required memory may exceed the default size limit of NetPagePool. Depending on when NetPagePool exhaustion happen, it can cause failure to initialize the NIC device in ESXi, complete loss of RX traffic on the NIC, or RX drop.

Resolution

ESXi 6.7 Patch 06 and ESXi 7.0 Update 1 increased the default NetPagePool size limit, so upgrading ESXi 6.7 / 7.0 to these or later versions may resolve the issue in most cases. If not, one may further increase the NetPagePool size with the workaround mentioned.

Workaround:

There are two workarounds available.

  1. Increase the size of the packet page pool.

The max value of the packet page pool is determined by two parameters: netPagePoolLimitCap and netPagePoolLimitPerGB. The max limit would be the smallest of the two max numbers which are calculated separately based on these two parameters.

The max value of the packet page pool = MIN(SystemMemoryNumGB *netPagePoolLimitPerGB, netPagePoolLimitCap)* 4096

So, first of all, we need to find out which parameter limited the current max value by checking the current max value of netPktPagePool, netPagePoolLimitCap and netPagePoolLimitPerGB. Then adjust the value of netPagePoolLimitPerGB or netPagePoolLimitCap based on the actual case.

The commands to check these above values are as follows:
[root@esxhost:~] memstats -r group-stats -s gid:name:max:consumed:consumedPeak -u mb | grep netPktPagePool
gid    name              max   consumed consumedPeak
----  ----------------   ----- -----    -----
163   netPktPagePool     1260  1260     1260
[root@esxhost:~]esxcli system settings kernel  list |grep netPagePoolLimitCap
Name                   Type    Configured  Runtime   Default   Description
-------------------    ------  ----------  -------   -------  -----------
netPagePoolLimitCap    uint32  1048576     1048576   1048576   Maximum number of pages period for the packet page pool.
[root@esxhost:~]esxcli system settings kernel  list |grep netPagePoolLimitPerGB
Name                   Type    Configured   Runtime  Default  Description
-------------------    ------  ----------   -------  -------  -----------
netPagePoolLimitPerGB  uint32  5120         5120     5120     Maximum number of pages for the packet page pool per gigabyte.


The commands to adjust netPagePoolLimitPerGB and netPagePoolLimitCap are as below:
esxcli system settings kernel set -s netPagePoolLimitPerGB -v <value>
esxcli system settings kernel set -s netPagePoolLimitCap -v <value>


Note: After adjusting netPagePoolLimitPerGB and netPagePoolLimitCap, it is required to reboot the ESXi host to take effect.

Usually, it's recommended that the max value of netPktPagePool is 4G if system memory is sufficient. Thus, for netPagePoolLimitCap, the corresponding value recommended is 1048576.
(Note: netPagePoolLimitCap is already 1048576 by default since ESXi 7.0u1 ).

For netPagePoolLimitPerGB, since it is related to the total size of system memory, the value is up to the specific case.

  1. Reduce the MTU to 4000 or smaller and disable HW LRO.


Additional Information

Nic down when adding to vSwitch
https://kb.vmware.com/s/article/76656


Impact/Risks:
  • Additional memory is in use

  • Reboot required