Using hardware NMI facilities to troubleshoot unresponsive hosts

search cancel

Using hardware NMI facilities to troubleshoot unresponsive hosts

book

Article ID: 344096

calendar_today

Updated On:

Products

VMware vSphere ESXi

Issue/Introduction

This article provides information about using Non-Maskable Interrupt (NMI) facilities to troubleshoot unresponsive VMware ESXi or ESX hosts.

This process should be followed if an ESXi/ESX host does not respond to interaction at the console or through the network and all hosted virtual machines do not respond to remote communication. For more information on these scenarios, see Determining why an ESX/ESXi host does not respond to user interaction at the console (1017135).

If an ESXi/ESX host is responsive at the console, but not manageable remotely, see Troubleshooting an unresponsive host and multiple Disconnected virtual machines (1019082).

Caution: This process likely causes the ESXi/ESX host to halt with a purple diagnostic screen. If the ESXi/ESX host is responding sufficiently to run virtual machines, triggering a purple diagnostic screen following this process abruptly powers down all virtual machines running on this ESXi/ESX host.

If you see an NMI with unknown origin, see Identifying and addressing Non-Maskable Interrupt events on an ESX/ESXi host (1804).

Environment

VMware ESXi 3.5.x Installable
VMware ESXi 4.0.x Installable
VMware ESX Server 3.5.x
VMware vSphere ESXi 6.0
VMware vSphere ESXi 5.0
VMware vSphere ESXi 6.7
VMware ESX Server 3.0.x
VMware ESXi 4.1.x Embedded
VMware vSphere ESXi 6.5
VMware ESXi 4.0.x Embedded
VMware ESX 4.0.x
VMware ESXi 3.5.x Embedded
VMware vSphere ESXi 7.0.0
VMware vSphere ESXi 5.5
VMware ESX 4.1.x
VMware ESXi 4.1.x Installable
VMware vSphere ESXi 5.1

Resolution

NMI Overview

A Non-Maskable Interrupt (NMI) is a hardware interrupt that cannot be ignored by the processor. These types of interrupts are usually reserved for very important tasks and to report hardware errors to the processor.

Depending on the make and model of the system, you may be able to deliberately send an NMI to the CPUs. By sending an NMI to the processor, it is forced to switch CPU context to the registered non-maskable interrupt handler. The interrupt cannot be ignored (masked). The operating system can handle the NMI based on prior configuration.

An intentionally triggered NMI can help to highlight:

Whether a CPU is capable of servicing interrupts.
Whether an operating system process or task is continuously looping on the CPU.

Note: Some servers have a BIOS or BMC option to automatically reboot the system whenever a Non-Maskable Interrupt occurs. If such a reboot occurs it implies that the hardware is operating correctly but does not provide enough information to troubleshoot the root cause of the issue. Disable the option.

NMIs and VMware ESXi/ESX

In some cases, you may want the ESXi/ESX host to generate a purple diagnostic screen and core dump to further troubleshoot an issue. By default, ESXi/ESX host prior to 5.0 only logs the NMI, but does not halt with a purple diagnostic screen. Starting with ESXi 5.0, the host halts with a purple diagnostic screen by default.

Either the VMkernel or the service console may break out of any continuously looping process on the CPU and log the NMI. As each kernel receives the NMI, it can be configured to respond to an NMI by generating a purple diagnostic screen.

The VMkernel handles an NMI directly and generates a purple diagnostic screen or route the NMI through to the service console. If routed to the ESX service console, its Linux kernel can handle an NMI by triggering an Oops and a purple diagnostic screen or ignore it.

The options available for additional handing of an NMI vary between versions of VMware ESXi and ESX:

ESXi 5.5: The VMkernel can be configured to take no action or to handle the NMI by halting with a purple diagnostic screen. Defaults to halting with a purple diagnostic screen.
ESXi 5.1: The VMkernel can be configured to take no action or to handle the NMI by halting with a purple diagnostic screen. Defaults to halting with a purple diagnostic screen.
ESXi 5.0: The VMkernel always handles the NMI by halting with a purple diagnostic screen. No configuration is needed.
ESXi 4.x: The VMkernel can be configured to take no action or to handle the NMI directly by halting with a purple diagnostic screen. Defaults to take no action.
ESX 4.x:The VMkernel can be configured to route the NMI to the service console or handle directly with a purple diagnostic screen. By default, the service console Linux kernel takes no action but can be configured to halt with a purple diagnostic screen.
ESX 3.x: The VMkernel always routes the NMI to the service console. By default, the service console Linux kernel takes no action but can be configured to halt with a purple diagnostic screen.
ESXi 3.5: The VMkernel always takes no action on the NMI.

If a purple diagnostic screen is triggered, a coredump from the VMkernel is saved. If the NMI was routed to the service console to trigger the purple diagnostic screen, a coredump from the service console Linux kernel is also saved. The service console coredump may be needed depending on the issue being investigated.

Ensure that the ESXi/ESX host is correctly configured to capture the VMkernel and service console coredumps.

For more information, see:

It is possible for third-party OEM NMI drivers to intentionally initiate halting with a purple diagnostic screen upon receipt of an NMI regardless of the configured option. For more information, see Understanding the message: Panic requested by one or more 3rd party NMI handlers (2005413).

Note: For more information on HP servers running VMware ESX/ESXi 4.1 or higher, see ESX and ESXi installations on HP systems require the HP NMI driver (1021609).

Configuring the ESX/ESXi VMkernel to generate a purple diagnostic screen on NMI

VMware ESXi 4.x / 5.x as well as ESX 4.x have an advanced configuration option that affects the actions taken upon receiving an NMI. By default, the NMI is routed to the service console, which has no effect in ESXi and is ignored by default in ESX.

The VMkernel option Misc.NMILint1IntAction has 3 possible values:

Enter debugger on hardware NMI.
Panic on hardware NMI, halting the VMkernel with a purple diagnostic screen.
On ESX, pass NMI to Service Console – see the Configuring the Service Console section. On ESXi, do nothing.

Note: If an ESXi/ESX host is unresponsive very early in the boot process, the VMkernel boot option VMkernel.Boot.nmiAction should be utilized instead. The default of 0 defers to the Misc.NMILint1IntAction option later in the boot process.

To configure the VMkernel to generate a purple diagnostic screen upon receiving an NMI, set the advanced option Misc.NMILint1IntAction to 2. For more information, see Configuring advanced options for ESXi/ESX (1038578).

Note: You must reboot the ESX/ESXi host for the change to effect.

Configuring the ESX 3.x or 4.x Service Console to generate a purple diagnostic screen on NMI

VMware ESX 3.x always routes the NMI to the service console and configuring options in the service console is the only mechanism to trigger a purple diagnostic screen on NMI under ESX 3.x. VMware ESX 4.x can be configured to handle the NMI in the VMkernel or to route it to the service console. This configuration is not applicable to ESXi. For ESXi, use the preceding VMkernel method.

By default, the service console Linux kernel in ESX 3.x and 4.x logs the NMI event but takes no other action. The service console Linux kernel can be configured to handle the NMI by halting with a purple diagnostic screen.

To configure the ESX host service console to halt with a panic on receiving the NMI:

Open a console to the ESX host. For more information, see Unable to connect to an ESX host using Secure Shell (SSH) (1003807).
Open the /etc/sysctl.conf file in a text editor. For more information, see Editing configuration files in VMware ESXi and ESX (1017022). This configuration file has a token=value syntax, with one configuration option per line.
Append entries for two configuration options at the bottom of the configuration file. The option name depends on the ESX host version:
- ESX 3.x:
  
  kernel.mem_nmi_panic = 1 kernel.unknown_nmi_panic = 1
- ESX 4.x:
  
  kernel.panic_on_unrecovered_nmi = 1 kernel.unknown_nmi_panic = 1
Save the configuration file.
Reload the configuration file using the command:

sysctl -p

Note: The configuration change takes effect immediately. All defined sysctl configuration options are displayed as they are applied. The two new configuration options should be displayed last.

Note: If this configuration must be reverted, edit the configuration file and set both options to 0.

Preparing to reproduce the issue

If an ESXi/ESX host was not configured appropriately prior to the outage, the issue must be reproduced before information about the unresponsive state is obtained.

Collect performance data leading up to the outage. For more information, see Using performance collection tools to gather data for fault analysis (1006797).
Recording logs externally through the serial port leading up to the outage. For more information, see Enabling serial-line logging for an ESXi/ESXi host (1003900).
Press Alt+F12 on the console to display the VMkernel logs on the screen. Leave these logs scrolling, they may be useful if the keyboard becomes unresponsive when the outage reoccurs.
You must know how to send an NMI on the particular hardware server system. For examples, see the Additional Information section.

Results and next steps

At the time of the next outage, re-check the symptoms described in Determining why an ESX/ESXi host does not respond to user interaction at the console (1017135) to ensure the same symptoms are observed.

If the server is completely unresponsive to keyboard input and network traffic, take a screenshot or photograph of the VMkernel logs. Check whether the VMkernel logs are continuing to scroll on the screen or whether they have frozen. When you have recorded the events, press the NMI button on the physical server or through the remote hardware management interface.

At this point, the server displays one of these symptoms:

The VMware ESXi/ESX host continues to be unresponsive and nothing is logged.

The hardware is completely unresponsive and does not react in any way to the NMI despite configuring the operating system software to respond accordingly. Engage the hardware vendor and consider using vendor-suggested hardware diagnostic software to run intensive hardware diagnostics for a prolonged period of time. If the hardware vendor does not suggest software, consider using the open-source MemTest86+.

Note: The preceding link was correct as of Sep 23, 2015. If you find the link is broken, provide a feedback and a VMware employee will update the link.
The VMware ESXi/ESX host abruptly reboots.

The hardware was able to service the interrupt, but may have initiated the restart itself. Some servers have a BIOS option to automatically reboot the system whenever a Non-Maskable Interrupt occurs. This implies the hardware may be operating correctly but does not provide enough information to proceed. Disable the BIOS option and repeat the test.
The VMware ESXi/ESX host logs NMI-related events but becomes unresponsive again.

The hardware is responsive and the ESXi/ESX kernel was capable of handling the interrupt and logging the event. This usually indicates a software issue as the cause. Although unlikely, this may occur if a driver or other process was stuck in an operational instruction loop. Review the VMkernel logs for Lint N or NMI events and any logs leading up to the outage. If the specific error has not been documented within the Knowledge Base, collect diagnostic information from the ESXi/ESX host and file a Support Request. For more information, see Collecting diagnostic information for VMware products (1008524) and How to File a Support Request.
The VMware ESXi/ESX host logs NMI-related events and becomes responsive.

The hardware is responsive and the ESXi/ESX kernel was capable of handling the interrupt. This usually indicates a software issue as the cause. Although unlikely, this may occur if a driver or other process was stuck in an operational instruction loop. Review the VMkernel logs for Lint N or NMI events and any logs leading up to the outage. If the specific error has not been documented within the knowledge base, collect diagnostic information from the ESXi/ESX host and file a Support Request. For more information, see Collecting Diagnostic Information for VMware Products (1008524) and How to File a Support Request.
The VMware ESXi/ESX host displays a purple diagnostic screen on the console.

The hardware is responsive and the ESXi/ESX kernel was capable of handling the interrupt. This usually indicates a software issue as the cause. When the purple diagnostic screen displays Disk dump successful towards the bottom of its output, take a screenshot or photograph and reboot the host. If the error has not been documented within the Knowledge Base, collect diagnostic information from the ESXi/ESX host including the core dump files, and submit a Support Request.

For more information, see

Additional Information

Triggering the NMI

The NMI button or switch location varies depending on the hardware. A small set of examples are available:

IBM x3650 M2 – The NMI button is on the diagnostic panel. There may also be a Send NMI button in the RSA. For more information, see the x3650 M2 Installation and Users Guide.
HP Proliant – The NMI button or jumper is on the motherboard. There is also a Send NMI button in the iLO. For more information, see Performing an HP ProLiant server NMI crash dump.
Dell R900 – The NMI button is on the front panel. For more information, see the R900 Systems Hardware Owner's Manual.
Fujitsu PRIMERGY Servers (RX/TX) - The NMI button is on the front of the server. For more information, see the Operating Manual for your PRIMERGY Servers. The manual can be found at the Fujitsu website.
1. Click [Industry standard servers] - [PRIMERGY Servers]
2. Select your PRIMERGY Servers from the pulldown menu. For example, [PRIMERGY RX Servers] - [PRIMERGY RX300 Sriese] - [PRIMERGY RX300 S7]
3. Download the Operating Manual and check for the NMI button location.
Cisco UCS – The NMI can be sent via IPMI or the UCS Manager command-line interface:
- IPMI command – ipmitool -I lan -H <RemoteServerBMCAddress> -U <Username> -a chassis power diag
- UCSM command – diagnostic-interrupt.
  
  For more information, see the Cisco UCS command-line reference documentation for the diagnostic-interrupt command.

The preceding links were correct as of November 6, 2013. If you find a link is broken, provide feedback and a VMware employee will update the link.

For information on how to trigger the NMI for a particular server system, consult your hardware vendor.

Configuring an ESXi/ESX host to capture a VMkernel coredump from a purple diagnostic screen
Unable to connect to an ESX host using Secure Shell (SSH)
Enabling serial-line logging for an ESXi/ESXi host
Collecting diagnostic information from an ESX or ESXi host that experiences a purple diagnostic screen
Using performance collection tools to gather data for fault analysis
Collecting diagnostic information for VMware products
Editing configuration files in VMware ESXi and ESX
Determining why an ESX/ESXi host does not respond to user interaction at the console
ESX/ESXi hosts do not respond and is grayed out
ESX and ESXi installations on HP systems require the HP NMI driver
Configuring an ESX host to capture a Service Console coredump
Configuring advanced options for ESXi/ESX
"LINT1 motherboard interrupt" error in an ESX/ESXi host
Understanding the message: Panic requested by one or more 3rd party NMI handlers
Como utilizar os recursos de NMI do hardware para solucionar o problema de hosts que não estão respondendo
Uso de funciones de NMI de hardware para solucionar problemas de hosts que no responden
使用硬件 NMI 工具对无响应的主机进行故障排除
ハードウェア NMI 機能を使用して応答しないホストのトラブルシューティングを実行する

Feedback

thumb_up Yes

thumb_down No