Search the VMware Knowledge Base (KB)
View by Article ID

ESXi host fails with intermittent NMI PSOD on HP ProLiant Gen8 servers (2149043)

  • 17 Ratings
Language Editions

Symptoms

  • ESXi hosts running 5.5 p10, 6.0 p04, 6.0 U3, or 6.5 GA may fail with a purple diagnostic screen caused by non-maskable-interrupts (NMI) on HPE ProLiant Gen8 Servers.
  • Intermittent purple diagnostic screens citing an NMI, Non-Maskable, or LINT1 interrupt similar to;

    2017-04-29T08:12:14.617Z cpu0:33074)@BlueScreen: LINT1/NMI (motherboard nonmaskable interrupt), undiagnosed. This may be a hardware problem; please contact your hardware vendor.
    2017-04-29T08:12:14.617Z cpu0:33074)Code start: 0x41800d200000 VMK uptime: 1:10:11:25.236
    2017-04-29T08:12:14.618Z cpu0:33074)0x4390c991b1b0:[0x41800d2780da]PanicvPanicInt@vmkernel#nover+0x37e stack: 0x4390c991b248
    2017-04-29T08:12:14.618Z cpu0:33074)0x4390c991b240:[0x41800d2783a5]Panic_NoSave@vmkernel#nover+0x4d stack: 0x4390c991b2a0
    2017-04-29T08:12:14.619Z cpu0:33074)0x4390c991b2a0:[0x41800d274373]NMICheckLint1Bottom@vmkernel#nover+0x53 stack: 0x4390c991b370
    2017-04-29T08:12:14.619Z cpu0:33074)0x4390c991b2b0:[0x41800d23307e]BH_DrainAndDisableInterrupts@vmkernel#nover+0xe2 stack: 0x0
    2017-04-29T08:12:14.620Z cpu0:33074)0x4390c991b340:[0x41800d256e22]IDT_IntrHandler@vmkernel#nover+0x1c6 stack: 0x0
    2017-04-29T08:12:14.620Z cpu0:33074)0x4390c991b370:[0x41800d2c8044]gate_entry_@vmkernel#nover+0x0 stack: 0x0
    2017-04-29T08:12:14.620Z cpu0:33074)0x4390c991b430:[0x41800d5048aa]Power_HaltPCPU@vmkernel#nover+0x1ee stack: 0x417fcd483f20
    2017-04-29T08:12:14.621Z cpu0:33074)0x4390c991b480:[0x41800d411c48]CpuSchedIdleLoopInt@vmkernel#nover+0x2f8 stack: 0x117308c314611
    2017-04-29T08:12:14.621Z cpu0:33074)0x4390c991b500:[0x41800d4153a3]CpuSchedDispatch@vmkernel#nover+0x16b3 stack: 0x4394002a7100
    2017-04-29T08:12:14.622Z cpu0:33074)0x4390c991b620:[0x41800d415f68]CpuSchedWait@vmkernel#nover+0x240 stack: 0x0
    2017-04-29T08:12:14.622Z cpu0:33074)0x4390c991b6a0:[0x41800d4162a5]CpuSchedTimedWaitInt@vmkernel#nover+0xc9 stack: 0x2001
    2017-04-29T08:12:14.623Z cpu0:33074)0x4390c991b720:[0x41800d416376]CpuSched_TimedWait@vmkernel#nover+0x36 stack: 0x430337ad30c0
    2017-04-29T08:12:14.623Z cpu0:33074)0x4390c991b740:[0x41800d219228]PageCacheAdjustSize@vmkernel#nover+0x344 stack: 0x0
    2017-04-29T08:12:14.623Z cpu0:33074)0x4390c991bfd0:[0x41800d416bfe]CpuSched_StartWorld@vmkernel#nover+0xa2 stack: 0x0
    2017-04-29T08:12:14.627Z cpu0:33074)base fs=0x0 gs=0x418040000000 Kgs=0x0

Cause

The issue was triggered by a change in ESXi 5.5 p10, 6.0 p04, 6.0 U3 and, 6.5 GA in which ESXi disables the Intel® IOMMU's (aka VT-d) interrupt remapper functionality. In HPE ProLiant Gen8 servers, this change is causing PCI errors which result in the platform generating an NMI and causing the ESXi host to fail with a purple diagnostic screen.

HPE has identified the cause of the issue on the HPE ProLiant DL560 Gen8 server and HPE ProLiant DL380p Gen8 server as high performing, low-latency PCIe adapters installed in slot 3 and systems under heavy load.For more information, see HPE CUSTOMER ADVISORY



Resolution

This is a known issue affecting ESXi 5.5 p10, ESXi 6.0 p04, 6.0 U3 and, ESXi 6.5 GA on HPE ProLiant Gen8 servers. This information is also available for reference on the HPE advisory.
 
For ESXi 6.5
 
This issue is resolved in ESXi 6.5 Patch Release ESXi650-201703001, available at VMware Patch Downloads. For more information on downloading patch, see How to download patches in MyVMware (1021623).
 
For ESXi 6.0
 
This issue is resolved in ESXi 6.0 Patch Release ESXi600-201706001, available at VMware Patch Downloads. For more information on downloading patch, see How to download patches in MyVMware (1021623).
 
Alternatively,

To resolve this issue, on the HPE ProLiant DL560 Gen8 server or the HPE ProLiant DL380p Gen8 Server when the IOMMU remapper is disabled, move the low-latency or high performing PCI-e card to slot 1,2,4,5 or 6 (depending on the type of secondary riser board that might be installed).

To work around this issue, re-enable the Intel® IOMMU interrupt remapper on the ESXi host:
  1. Connect to the ESXi host with an SSH session and root credentials.

  2. Run this command:

    esxcli system settings kernel set --setting=iovDisableIR -v FALSE

  3. Reboot the ESXi host.

  4. Ensure that the iovDisableIR setting is set to FALSE by running this command:

    esxcli system settings kernel list -o iovDisableIR

    For example:

    esxcli system settings kernel list -o iovDisableIR

    Name          Type  Description                                 Configured  Runtime  Default
    ------------  ----  ---------------------------------------     ----------  -------  -------
    iovDisableIR  Bool  Disable Interrupt Routing in the IOMMU...   FALSE       FALSE    TRUE
            

See Also

Request a Product Feature

To request a new product feature or to provide feedback on a VMware product, please visit the Request a Product Feature page.

Feedback

  • 17 Ratings

Did this article help you?
This article resolved my issue.
This article did not resolve my issue.
This article helped but additional information was required to resolve my issue.

What can we do to improve this information? (4000 or fewer characters)




Please enter the Captcha code before clicking Submit.
  • 17 Ratings
Actions
KB: