This article provides information about using Non-Maskable Interrupt (NMI) facilities to troubleshoot unresponsive VMware ESXi or ESX hosts.
Caution: This process likely causes the ESXi/ESX host to halt with a purple diagnostic screen. If the ESXi/ESX host is responding sufficiently to run virtual machines, triggering a purple diagnostic screen following this process abruptly powers down all virtual machines running on this ESXi/ESX host.
If you see an NMI with unknown origin, see Identifying and addressing Non-Maskable Interrupt events on an ESX/ESXi host (1804).
A Non-Maskable Interrupt (NMI) is a hardware interrupt that cannot be ignored by the processor. These types of interrupts are usually reserved for very important tasks and to report hardware errors to the processor.
Depending on the make and model of the system, you may be able to deliberately send an NMI to the CPUs. By sending an NMI to the processor, it is forced to switch CPU context to the registered non-maskable interrupt handler. The interrupt cannot be ignored (masked). The operating system can handle the NMI based on prior configuration.
An intentionally triggered NMI can help to highlight:
In some cases, you may want the ESXi/ESX host to generate a purple diagnostic screen and core dump to further troubleshoot an issue. By default, ESXi/ESX host prior to 5.0 only logs the NMI, but does not halt with a purple diagnostic screen. Starting with ESXi 5.0, the host halts with a purple diagnostic screen by default.
Either the VMkernel or the service console may break out of any continuously looping process on the CPU and log the NMI. As each kernel receives the NMI, it can be configured to respond to an NMI by generating a purple diagnostic screen.
The VMkernel handles an NMI directly and generates a purple diagnostic screen or route the NMI through to the service console. If routed to the ESX service console, its Linux kernel can handle an NMI by triggering an Oops and a purple diagnostic screen or ignore it.
The options available for additional handing of an NMI vary between versions of VMware ESXi and ESX:
If a purple diagnostic screen is triggered, a coredump from the VMkernel is saved. If the NMI was routed to the service console to trigger the purple diagnostic screen, a coredump from the service console Linux kernel is also saved. The service console coredump may be needed depending on the issue being investigated.
Ensure that the ESXi/ESX host is correctly configured to capture the VMkernel and service console coredumps.
For more information, see:
It is possible for third-party OEM NMI drivers to intentionally initiate halting with a purple diagnostic screen upon receipt of an NMI regardless of the configured option. For more information, see Understanding the message: Panic requested by one or more 3rd party NMI handlers (2005413).
Note: For more information on HP servers running VMware ESX/ESXi 4.1 or higher, see ESX and ESXi installations on HP systems require the HP NMI driver (1021609).
VMware ESXi 4.x / 5.x as well as ESX 4.x have an advanced configuration option that affects the actions taken upon receiving an NMI. By default, the NMI is routed to the service console, which has no effect in ESXi and is ignored by default in ESX.
The VMkernel option Misc.NMILint1IntAction
has 3 possible values:
Note: If an ESXi/ESX host is unresponsive very early in the boot process, the VMkernel boot option VMkernel.Boot.nmiAction
should be utilized instead. The default of 0
defers to the Misc.NMILint1IntAction
option later in the boot process.
To configure the VMkernel to generate a purple diagnostic screen upon receiving an NMI, set the advanced option Misc.NMILint1IntAction
to 2. For more information, see Configuring advanced options for ESXi/ESX (1038578).
Note: You must reboot the ESX/ESXi host for the change to effect.
VMware ESX 3.x always routes the NMI to the service console and configuring options in the service console is the only mechanism to trigger a purple diagnostic screen on NMI under ESX 3.x. VMware ESX 4.x can be configured to handle the NMI in the VMkernel or to route it to the service console. This configuration is not applicable to ESXi. For ESXi, use the preceding VMkernel method.
By default, the service console Linux kernel in ESX 3.x and 4.x logs the NMI event but takes no other action. The service console Linux kernel can be configured to handle the NMI by halting with a purple diagnostic screen.
To configure the ESX host service console to halt with a panic on receiving the NMI:
/etc/sysctl.conf
file in a text editor. For more information, see Editing configuration files in VMware ESXi and ESX (1017022). This configuration file has a token=value
syntax, with one configuration option per line.kernel.mem_nmi_panic = 1
kernel.unknown_nmi_panic = 1
kernel.panic_on_unrecovered_nmi = 1
kernel.unknown_nmi_panic = 1
sysctl -p
sysctl
configuration options are displayed as they are applied. The two new configuration options should be displayed last.Note: If this configuration must be reverted, edit the configuration file and set both options to 0.
If an ESXi/ESX host was not configured appropriately prior to the outage, the issue must be reproduced before information about the unresponsive state is obtained.
At the time of the next outage, re-check the symptoms described in Determining why an ESX/ESXi host does not respond to user interaction at the console (1017135) to ensure the same symptoms are observed.
If the server is completely unresponsive to keyboard input and network traffic, take a screenshot or photograph of the VMkernel logs. Check whether the VMkernel logs are continuing to scroll on the screen or whether they have frozen. When you have recorded the events, press the NMI button on the physical server or through the remote hardware management interface.
At this point, the server displays one of these symptoms:
The NMI button or switch location varies depending on the hardware. A small set of examples are available:
ipmitool -I lan -H <RemoteServerBMCAddress> -U <Username> -a chassis power diag
The preceding links were correct as of November 6, 2013. If you find a link is broken, provide feedback and a VMware employee will update the link.
For information on how to trigger the NMI for a particular server system, consult your hardware vendor.
Configuring an ESXi/ESX host to capture a VMkernel coredump from a purple diagnostic screen