vCenter reports NSX Edge CPU usage close to 100% but Edge vCPUs are not fully saturated
search cancel

vCenter reports NSX Edge CPU usage close to 100% but Edge vCPUs are not fully saturated

book

Article ID: 336538

calendar_today

Updated On:

Products

VMware NSX Networking

Issue/Introduction

This article aims to explain how the vCenter reports the Edge CPU statistics compared to the Edge's reporting and also explain how the Edge utilizes the individual vCPUs


Symptoms:

vCenter reports 100% or higher NSX Edge CPU Utilization in vCenter but in actuality, Edge vCPUs are not fully utilized.

To confirm this issue is happening we can confirm Edge vCPUs show usage below the threshold for alerting

Step 1
SSH session to the ESXi where the active Edge resides. From the ESXi SSH session, run the following

esxtop

While esxtop is running, you can narrow what content to display. Use "V" to show only virtual machine worlds. You can select specific rows in esxtop by using "2" to scroll down and "8" to scroll up. You can remove highlighted rows from the view by pressing "4". This may be necessary to cull extraneous rows and view relevant information.

Make note from the NAME column of the edge in question. Here, you'll note the %Used Colum may be a very high percent, up to 500% for the Edge. This is the number that vCenter will report as the CPU usage of the Edge.

In that same row, make note of the GID (Group ID) of the Edge. From the esxtop, you can expand the statistics for that specific group, showing details of all worlds associated with that GID. This is accomplished by pressing "e" then the GID number.


Here, the %Used column shows the CPU % for each individual vCPU on the Edge, the top row of that GID shows the aggregate of the vCPUS of the VM. The %Sys is the kernel threads consumed on the ESX for this VM.

vCenter's total utilization for this Edge is the total vCPU sum plus the kernel threads (%Used + %Sys).

The Edge's report of its own CPU usage is the total vCPU sum.

In this view, you'll see individual vCPUs for the Edge. The number of vCPUs on an NSX-T Edge depends upon the form factor.

Here is list -

Compact - 1 vCPU
Large - 2 vCPU
Quad Large - 4 vCPU
X-Large - 6 vCPU

Step 2

Get on to Edge to SSH Session to an Edge VM and login with admin credentials and run the following command
show process monitor

In the output make sure that the "USAGE" of each core is less than 80%

If it is above 80%, Confirm if there are any packet drops by running the below command on the ESXi host in which the Edge VM resides

esxtop - Press "n" to view Edge VM network usage - or problems by evaluating PKTTX/s, PKTRX/s, %DRPTX, %DRPRX.
 
PPS rate is the reason for high CPU utilization for networking components. Reduce the traffic flow through the edge VM and check if CPU Utilization goes down


Environment

VMware NSX for vSphere 6.4.x

Cause

vCenter reports CPU usage in the GUI for any VM using the output of esxtop system time

esxtop system time is an aggregate value of total CPU usage of vCPUs of the VM plus the system time consumed by ESX kernel threads on that VM's behalf (%Used + %Sys)

Note that system time consumed by kernel threads is very low for most VMs, but an Edge has network threads that can consume a lot of CPU handling traffic

Resolution

Get on to Edge to SSH Session to an Edge VM and login with admin credentials and run the following command

get dataplane cpu stats

Consider expanding the edge node infrastructure horizontally or vertically if usage on the dataplane is above 80%


Additional Information

Take note that on an X-Large Edge, the last two vCPUs are reserved for encryption, load balancing, and management function, meaning their %Used may be very low compared to the other 4 vCPUs. On Quad Large and lower, these tasks do not have reserved vCPUs.

This is technically a Linux Accounting Bug, relevant to ANY Linux virtual machine with heavy I/O. Edges are simply more prone to the issue as they have such heavy network traffic.

NSX-T article for the same issue is located here -

https://kb.vmware.com/s/article/89624

This is a helpful link to help interpret esxtop outputs:
https://communities.vmware.com/docs/DOC-9279

Impact/Risks:

vCenter and vRNI show alerts about Edge VM High or 100% CPU Utilization