An NMI is a physical hardware event. It is typically the result of a non-recoverable condition (in the context of continued operation during that specific boot cycle) that the system BIOS and/or management chipset encounters.
NMI events
NMI events are routed by the CPU through the Advanced Programmable Interrupt Controller (APIC) to the operating system (in this case, the ESXi host) through the operating system kernel (in this case, the VMkernel).
An NMI event occurs due to hardware issues such as:
- A PCI bus error, typically caused by a misbehaving I/O device or an electrical glitch.
- A bad memory module or processor.
- Severe thermal cycling of a critical component, usually after an extended downtime or a cooling component failure.
- Components running out-of-specification, such as an over-voltage or under-voltage condition due to hardware fault involving a voltage regulator module.
- Unapproved or incompatible components, such as an active memory backplane whose design revision is too early for the chassis.
- A firmware, BIOS or other component mismatch. For example, such as option-card of revision X requiring a minimum option-card firmware revision Y and a minimum chassis BIOS revision Z.
- On some systems, the CPU IOMMU feature that is used to map the DMA memory for a device from the host operating system to the guest operating system is configured by firmware to raise an NMI when it encounters an error, instead of allowing the operating system to catch and diagnose the error. IOMMU errors are typically caused by misbehaving I/O device drivers or firmware.
If you experience an NMI event:
- Identify the virtual machines (if any) were powered on at the time of the NMI event.
- Check if powering on a specific virtual machine trigger an NMI event.
- Reseat the PCI cards and/or move them to different slots.
To resolve the NMI event, contact the hardware vendor and provide these data:
- Timeframe that the event happened.
- At least 10 minutes of logs leading up to the event.
- Chassis diagnostics log output and management chipset log output.
- Chassis vital product data.
- A copy of the
vm-support
output. - The relevant VMware Service Request number, if opened.
Notes:
- Chassis management chipsets often function as an intelligent handler for chassis faults and can capture significant amounts of information during an NMI event.
- The IBM xSeries chassis includes a BIOS option of Reboot on System NMI. When enabled, this results in an immediate chassis-reboot rather than a chassis-halt. In this event the ESXi host logs do not mention the NMI. Other enterprise hardware vendors may offer a similar BIOS option.