ESXi host fails with intermittent NMI PSOD on HP ProLiant Gen8 servers

Products

VMware

Issue/Introduction

Symptoms:

ESXi hosts running 5.5 p10, 5.5 ep11, 6.0 p04, 6.0 U3, or 6.5 GA may fail with a purple diagnostic screen caused by non-maskable-interrupts (NMI) on HPE ProLiant Gen8 Servers.
Intermittent purple diagnostic screens citing an NMI, Non-Maskable, or LINT1 interrupt similar to:

2017-04-29T08:12:14.617Z cpu0:33074)@BlueScreen: LINT1/NMI (motherboard nonmaskable interrupt), undiagnosed. This may be a hardware problem; please contact your hardware vendor.
2017-04-29T08:12:14.617Z cpu0:33074)Code start: 0x41800d200000 VMK uptime: 1:10:11:25.236
2017-04-29T08:12:14.618Z cpu0:33074)0x4390c991b1b0:[0x41800d2780da]PanicvPanicInt@vmkernel#nover+0x37e stack: 0x4390c991b248
2017-04-29T08:12:14.618Z cpu0:33074)0x4390c991b240:[0x41800d2783a5]Panic_NoSave@vmkernel#nover+0x4d stack: 0x4390c991b2a0
2017-04-29T08:12:14.619Z cpu0:33074)0x4390c991b2a0:[0x41800d274373]NMICheckLint1Bottom@vmkernel#nover+0x53 stack: 0x4390c991b370
2017-04-29T08:12:14.619Z cpu0:33074)0x4390c991b2b0:[0x41800d23307e]BH_DrainAndDisableInterrupts@vmkernel#nover+0xe2 stack: 0x0
2017-04-29T08:12:14.620Z cpu0:33074)0x4390c991b340:[0x41800d256e22]IDT_IntrHandler@vmkernel#nover+0x1c6 stack: 0x0
2017-04-29T08:12:14.620Z cpu0:33074)0x4390c991b370:[0x41800d2c8044]gate_entry_@vmkernel#nover+0x0 stack: 0x0
2017-04-29T08:12:14.620Z cpu0:33074)0x4390c991b430:[0x41800d5048aa]Power_HaltPCPU@vmkernel#nover+0x1ee stack: 0x417fcd483f20
2017-04-29T08:12:14.621Z cpu0:33074)0x4390c991b480:[0x41800d411c48]CpuSchedIdleLoopInt@vmkernel#nover+0x2f8 stack: 0x117308c314611
2017-04-29T08:12:14.621Z cpu0:33074)0x4390c991b500:[0x41800d4153a3]CpuSchedDispatch@vmkernel#nover+0x16b3 stack: 0x4394002a7100
2017-04-29T08:12:14.622Z cpu0:33074)0x4390c991b620:[0x41800d415f68]CpuSchedWait@vmkernel#nover+0x240 stack: 0x0
2017-04-29T08:12:14.622Z cpu0:33074)0x4390c991b6a0:[0x41800d4162a5]CpuSchedTimedWaitInt@vmkernel#nover+0xc9 stack: 0x2001
2017-04-29T08:12:14.623Z cpu0:33074)0x4390c991b720:[0x41800d416376]CpuSched_TimedWait@vmkernel#nover+0x36 stack: 0x430337ad30c0
2017-04-29T08:12:14.623Z cpu0:33074)0x4390c991b740:[0x41800d219228]PageCacheAdjustSize@vmkernel#nover+0x344 stack: 0x0
2017-04-29T08:12:14.623Z cpu0:33074)0x4390c991bfd0:[0x41800d416bfe]CpuSched_StartWorld@vmkernel#nover+0xa2 stack: 0x0
2017-04-29T08:12:14.627Z cpu0:33074)base fs=0x0 gs=0x418040000000 Kgs=0x0

Cause

The issue was triggered by a change in ESXi 5.5 p10, 5.5 ep11, 6.0 p04, 6.0 U3 and, 6.5 GA in which ESXi disables the Intel IOMMU's (aka VT-d) interrupt remapper functionality. In HPE ProLiant Gen8 servers, this change causes PCI errors which result in the platform generating an NMI and causing the ESXi host to fail with a purple diagnostic screen.

HPE has identified the cause of the issue on the HPE ProLiant DL560 Gen8 server and HPE ProLiant DL380p Gen8 server as high performing, low-latency PCIe adapters installed in slot 3 and systems under heavy load.For more information, see HPE CUSTOMER ADVISORY.

Disclaimer: VMware is not responsible for the reliability of any data, opinions, advice, or statements made on third-party websites. Inclusion of such links does not imply that VMware endorses, recommends, or accepts any responsibility for the content of such sites.

Resolution

This is a known issue affecting ESXi 5.5 p10, ESXi 5.5 ep11, ESXi 6.0 p04, 6.0 U3 and, ESXi 6.5 GA on HPE ProLiant Gen8 servers. This information is also available for reference on the HPE advisory.

For ESXi 6.5

This issue is resolved in ESXi 6.5 Patch Release ESXi650-201703001, available at VMware Patch Downloads. For more information on downloading patch, see How to download patches in Customer Connect (1021623).

For ESXi 6.0

This issue is resolved in ESXi 6.0 Patch Release ESXi600-201706001, available at VMware Patch Downloads. For more information on downloading patch, see How to download patches in Customer Connect (1021623).

For ESXi 5.5

This issue is resolved in ESXi 5.5 Patch Release ESXi550-201709401, available at VMware Patch Downloads. For more information on downloading patch, see How to download patches in Customer Connect (1021623).

Alternatively,

To resolve this issue, on the HPE ProLiant DL560 Gen8 server or the HPE ProLiant DL380p Gen8 Server when the IOMMU remapper is disabled, move the low-latency or high performing PCI-e card to slot 1,2,4,5 or 6 (depending on the type of secondary riser board that might be installed).

To work around this issue, re-enable the Intel IOMMU interrupt remapper on the ESXi host:

Connect to the ESXi host with an SSH session and root credentials.
Run this command:

esxcli system settings kernel set --setting=iovDisableIR -v FALSE
Reboot the ESXi host.
Ensure that the iovDisableIR setting is set to FALSE by running this command:

esxcli system settings kernel list -o iovDisableIR

For example:

esxcli system settings kernel list -o iovDisableIR

Name Type Description Configured Runtime Default
------------ ---- --------------------------------------- ---------- ------- -------
iovDisableIR Bool Disable Interrupt Routing in the IOMMU... FALSE FALSE TRUE

Additional Information

How to download patches in Customer Connect
ESXi host fails with a diagnostic screen due to an Intel Virtualization Technology Erratum
ESXi 主机失败并在 HP ProLiant Gen8 服务器上显示间歇性 NMI PSOD
ESXi ホストが HP ProLiant Gen8 サーバでの断続的な NMI パープルスクリーンで失敗する