Intermittent 100% CPU Usage spikes on hosts with AMD EPYC Zen3 (Milan) CPUs
search cancel

Intermittent 100% CPU Usage spikes on hosts with AMD EPYC Zen3 (Milan) CPUs

book

Article ID: 318366

calendar_today

Updated On:

Products

VMware vSphere ESXi

Issue/Introduction

Symptoms:

When using ESXi on hosts with AMD Zen3 (7XX3) based CPUs, independent of load and even without running virtual machines, you might notice:

  • 100% CPU Usage spikes on any PCPU at random times when observed in the vSphere Client
  • Multi thousand % or more PCPU spikes when looking at esxtop / esxtop batch data
  • Host CPU Usage averages might over-report significantly with more spikes
  • No correlating high CPU usage from virtual machines or other worlds at the time of the spikes
  • For running VMs, large amounts of "CPU Latency", consisting of mostly "Overlap"
    • In esxtop, those metrics are called %LAT_C and %OVRLP
    • In vROps, CPU Contention and Overlap respectively
  • CPU Usage for VMs might under-report and even drop below utilization due to "CPU Latency"
  • Cluster level CPU Usage is derived from VM CPU usage and might also under-report
  • When running many highly utilized virtual machines, some performance impact might be seen


Environment

VMware vSphere ESXi 7.0.x
VMware ESXi 6.7.x

Cause

ESXi utilizes the Processor Monitor Counter "Non Halted Core Cycles" (NHCC) for frequency scaling aware CPU usage accounting. This counter is read via the RDPMC instructions, which in itself is not guaranteed to only return an increased value when executed in short succession. When those results are returned in an unexpected order, the calculated values "wrap around" and will lead to excessive CPU usage accounting. While this issue might be seen on other CPUs, it is more noticeable on AMD Milan due to architectural differences.

Note that with the exception of capacity planning or alerts triggering based on the increased CPU usage, the issue is mostly cosmetic and should not impact operation or performance. However, when running many VMs, especially when vCPU overcommitted, fairness might not be ensured. So some usually more entitled VMs might notice more contention compared to other (usually) less entitled VMs.

Resolution

This issue is resolved in ESXi 6.7 P06 and ESXi 7.0 U3.


Workaround:

In the unlikely event that this is impacting performance or you need stable metrics for capacity planning, you can disable NHCC based CPU usage accounting by setting the advanced kernel (boot) setting "useNHCC" to false. Note that this will make ESXi unaware of run-time frequency scaling and might result in different scheduling behavior for some workloads. There is no scheduling impact to ESXi hosts that run at a set or their maximum frequency at all times.

On the ESXi CLI this can be done via:

esxcli system settings kernel set -s useNHCC -v FALSE