Determining if virtual machine and ESX host unresponsiveness is caused by hardware issues

search cancel

Determining if virtual machine and ESX host unresponsiveness is caused by hardware issues

book

Article ID: 343529

calendar_today

Updated On:

Products

VMware vSphere ESXi

Issue/Introduction

Hardware issues with ESX hosts can manifest themselves in several different ways:

Virtual machines become unresponsive at random intervals on a single ESX host
The cause of the unresponsive state is different every time
You see purple screen errors on the ESX host. For more information, see Interpreting an ESX host purple diagnostic screen (1004250)
You see Error Correcting Code (ECC) errors in hardware monitoring logs
You receive Machine Check Exception (MCE) errors. For more information, see Decoding Machine Check Exception (MCE) output after a purple screen error (1005184)
You receive Non Maskable Interrupt (NMI) errors. For more information, see Identifying and addressing Non-Maskable Interrupt events on an ESX host (1804)
The ESX or ESXi host is hung and unresponsive. For more information, see Using hardware NMI facilities to troubleshoot unresponsive hosts (1014767)

Environment

VMware vSphere ESXi 5.0
VMware ESXi 4.0.x Installable
VMware ESXi 4.1.x Installable
VMware ESXi 4.0.x Embedded
VMware ESX Server 3.0.x
VMware ESXi 3.5.x Embedded
VMware ESX Server 2.0.x
VMware ESX 4.0.x
VMware ESX Server 3.5.x
VMware ESX 4.1.x
VMware ESXi 3.5.x Installable
VMware ESX Server 2.1.x
VMware ESXi 4.1.x Embedded
VMware vSphere ESXi 5.5
VMware vSphere ESXi 5.1
VMware ESX Server 2.5.x

Resolution

A virtual machine may become unresponsive and reference a bug that was fixed years ago. This occurs because the "regs" value references a known bug number in VMware code. Here is an example of a virtual machine error on an ESX 3.0.2 host:

vmkernel: 2:09:56:43.671 cpu2:1136)WARNING: World: vm 1136: 6012: vmm0:HSCPTOYFAPPS101:vcpu-0:SyncCB failure: 4408b65 (bug #4938)
vmkernel: 2:09:56:43.671 cpu2:1136)World: 6015: vmm group leader = 1136, members = 1
vmkernel: 2:09:56:43.671 cpu2:1136)Backtrace for current CPU #2, worldID=1136, ebp=0x35c3f88
vmkernel: 2:09:56:43.671 cpu2:1136)0x35c3f88:[0x63c1cb]World_VMMPanic+0xa7(0xd115e0, 0x573c8, 0x0, 0x2e90, 0x0)
vmkernel: 2:09:56:43.671 cpu2:1136)0x35c3fb0:[0x63c1cb]World_VMMPanic+0xa7(0x29, 0x2e68, 0x61dd90, 0x35c4000, 0x111c0021)
vmkernel: 2:09:56:43.672 cpu2:1136)0x35c3fe8:[0x61de1a]VMKCall+0x8a(0x29, 0x2e68, 0x1046, 0x2e2c, 0x0)

$ fgrep -i backtrace /var/log/vmkernel*

vmkernel: 2:09:56:43.671 cpu2:1136)Backtrace for current CPU #2, worldID=1136, ebp=0x35c3f88
vmkernel: 3:06:14:23.172 cpu2:1131)Backtrace for current CPU #2, worldID=1131, ebp=0xc3fc14
vmkernel: 1:20:57:09.892 cpu2:1132)Backtrace for current CPU #2, worldID=1132, ebp=0x35b3f88
vmkernel: 0:00:42:56.240 cpu2:1109)Backtrace for current CPU #2, worldID=1109, ebp=0x3557f88
vmkernel: 1:05:29:16.295 cpu2:1146)Backtrace for current CPU #2, worldID=1146, ebp=0x35ebf88
vmkernel: 2:01:59:23.758 cpu2:1152)Backtrace for current CPU #2, worldID=1152, ebp=0x3603f88
vmkernel: 2:14:22:31.688 cpu2:1139)Backtrace for current CPU #2, worldID=1139, ebp=0x35cff88
vmkernel: 3:03:17:24.927 cpu2:1116)Backtrace for current CPU #2, worldID=1116, ebp=0x3573f88
vmkernel: 3:10:16:51.096 cpu2:1135)Backtrace for current CPU #2, worldID=1135, ebp=0x35bff88
vmkernel: 2:17:31:09.784 cpu2:1112)Backtrace for current CPU #2, worldID=1112, ebp=0x3563f88
vmkernel: 2:17:53:48.630 cpu2:1131)Backtrace for current CPU #2, worldID=1131, ebp=0x35aff88

The above example indicates a potential problem with logical CPU #2 since it has the most backtraces. However, this might be a coincidence and the problem could instead be motherboard/mainboard related.

Note: If only a single virtual machine is affected, determine if the virtual machine was converted from a physical server. There may still be some hardware management agents or other application/driver inside the guest that could cause a violation. VMware recommends verifying that the guest operating system is supported by ESX and if the guest is supported to run on that hardware (64-bit virtual machine running without VT enabled in the BIOS).

You must review the vmkernel logs to determine if the errors are unique and if different machines experience the errors:

vmkernel.1

---

vmkernel: 2:09:56:43.671 cpu2:1136)WARNING: World: vm 1136: 6012: vmm0:HSCPTOYFAPPS101:vcpu-0:SyncCB failure: 4408b65 (bug #4938)
vmkernel: 2:09:56:43.671 cpu2:1136)World: 6015: vmm group leader = 1136, members = 1
vmkernel: 2:09:56:43.671 cpu2:1136)Backtrace for current CPU #2, worldID=1136, ebp=0x35c3f88

vmkernel: 3:06:14:23.172 cpu2:1131)WARNING: World: vm 1131: 6012: vmm0:HSCPKIMCAPWF102:vmk: vcpu-0:VMM DoubleFault @ 0x46793 (0x5fc4, 0x5fcc)
vmkernel: 3:06:14:23.172 cpu2:1131)World: 6015: vmm group leader = 1131, members = 1
vmkernel: 3:06:14:23.172 cpu2:1131)Backtrace for current CPU #2, worldID=1131, ebp=0xc3fc14

vmkernel.2

---

vmkernel: 1:20:57:09.892 cpu2:1132)WARNING: World: vm 1132: 6012: vmm0:HSCTKIMCAPIE302:vcpu-0:VMM fault: regs=0x2e48, exc=14, eip=0x3cca8
vmkernel: 1:20:57:09.892 cpu2:1132)World: 6015: vmm group leader = 1132, members = 1
vmkernel: 1:20:57:09.892 cpu2:1132)Backtrace for current CPU #2, worldID=1132, ebp=0x35b3f88

vmkernel.3

---

vmkernel: 0:00:42:56.240 cpu2:1109)WARNING: World: vm 1109: 6012: vmm0:HSCDKIMCAPWS207:vcpu-0:VMM fault: regs=0x2704, exc=14, eip=0x309d0
vmkernel: 0:00:42:56.240 cpu2:1109)World: 6015: vmm group leader = 1109, members = 1
vmkernel: 0:00:42:56.240 cpu2:1109)Backtrace for current CPU #2, worldID=1109, ebp=0x3557f88

vmkernel: 1:05:29:16.295 cpu2:1146)WARNING: World: vm 1146: 6012: vmm0:HSCDKIMCDBSQ201:vcpu-0:VMM fault: regs=0x2f94, exc=13, eip=0x6acc1
vmkernel: 1:05:29:16.295 cpu2:1146)World: 6015: vmm group leader = 1146, members = 2
vmkernel: 1:05:29:16.295 cpu2:1146)Backtrace for current CPU #2, worldID=1146, ebp=0x35ebf88

The above logs indicate that the errors are unique and that different machines experience the errors. This behavior, coupled with the backtraces occurring on CPU2, indicates a global cause such as a faulty processor or motherboard.

The server in the example above also experiences a purple screen error a day after each of the errors. The purple screen contained content similar to:

VMware ESX Server [Releasebuild-52797]
Exception type 14 in world 1091:vmm0:HSCTKIM @ 0x64dc24
gate=0xe frame=0x350fdac eip=0x64dc24 cr2=0x401e8fb5 cr3=0xe4da1000 cr4=0x669
eax=0x405e8fb3 ebx=0x1 ecx=0x405e8000 edx=0x405e8000 es=0x4041 ds=0x4041
fs=0x4048 gs=0x4041 ebp=0x350fe28 esi=0x860bba91 edi=0x92cc7687 err=0 ef=0x11002
cpu 0 1024 console: cpu 1 1111 vmm0:HSCD: CPU 2 1091 vmm0:HSCT: cpu 3 1093 mks:HSCTK:
cpu 4 1098 mks:HSCTK: cpu 5 1116 vmm0:HSCD: cpu 6 1121 vmm0:HSCD: cpu 7 1123 mks:HSCDK:
@BlueScreen: Exception type 14 in world 1091:vmm0:HSCTKIM @ 0x64dc24
0x350fe28:[0x64dc24]PShareHashTableWalk+0x58(0x1479b20, 0x0, 0x1)
0x350fe64:[0x64def6]PShareAddPage+0x6a(0x1479b20, 0x77fe12, 0x860bba91)
0x350fea0:[0x64ee56]PShare_AddIfShared+0x96(0x860bba91, 0x92cc7687, 0x77fe12)
0x350ff58:[0x607588]AllocCOWSharePage+0x354(0xcd66fc, 0x656f, 0x407ca710)
0x350ff84:[0x60832b]AllocCOWSharePages+0x4b(0xcd66fc, 0xc, 0x407ca000)
0x350ffb0:[0x608424]Alloc_COWSharePages+0x94(0xc, 0x2d58, 0x61dd90)
0x350ffe8:[0x61de1a]VMKCall+0x8a(0xc, 0x2d58, 0x1246)

The backtrace here does not match any backtraces for known issues. Analysis of each command shows that the system dumped after failing to physically share the hash table, as well as after adding and sharing pages.

The content of the purple screen error shows the potential for a hardware failure. As such, you must have the hardware inspected and replaced. If the OEM hardware diagnostics CD provided with the server does not reveal a hardware problem, contact your OEM vendor directly.

The analysis in this article proves that this issue is not caused by a problem with ESX code.

Note: Multiple purple screens that reflect different CPUs in each instance still indicates a hardware problem with either the CPUs or the motherboard. Run OEM Hardware Diagnostics and contact your OEM vendor for assistance.For more information, see Virtual machine and ESX/ESXi host outage pattern analysis across physical CPUs (2003929).

Feedback

thumb_up Yes

thumb_down No