Knowledge Base

Search the Knowledge Base: |
Search the Knowledge Base: |
Decoding Machine Check Exception (MCE) output after a purple screen error
Symptoms
-
A purple screen fault is encountered. The type of fault is Machine Check Exception (MCE).
-
The purple screen shows the following message:
Machine Check Exception: Unable to continue -
When extracting the logs from the core dump you see messages similar to:
ALERT: MCE: 171: Machine Check Exception: Bank 5, Status b200001806000e0f -
The system halts with a purple screen that looks similar to:
Resolution
This article contains the following sections:
What is a Machine-Check Exception (MCE)?
The machine check archetecture is a mechanism within the CPU to detect and report hardware issues. When a problem is detected a machine-check exception (#MC) is thrown. If a machine-check exception has been thrown and a purple screen fault occurs, a hardware problem has caused it. There is no other way to throw a machine-check exception.
Understanding the machine-check architecture
-
MCG_CAP – a read-only register that provides information about the machine-check architecture implementation
-
MCG_CTL – controls the reporting of machine-check exceptions
-
MCG_STATUS – reports information when a machine-check exception occurs
-
MCi_CTL – controls the reporting for the bank
-
MCi_STATUS – contains the information about the machine-check exception
-
MCi_ADDR – memory address of the exception (if deemed valid)
-
MCi_MISC – additional description of the machine-check exception (if deemed valid)
-
Bank 0 – Data Cache
-
Bank 1 – Instruction Cache
-
Bank 2 – Bus Unit
-
Bank 3 – Load Store Unit
-
Bank 4 – Northbridge and DRAM
Decoding the machine-check exception
0:00:28:43.588 cpu0:1077)ALERT: MCE: 169: Machine Check Exception: General Status 0000000000000004
0:00:28:43.588 cpu0:1077)ALERT: MCE: 193: Machine Check Exception: Bank 0, Status be0000001008081f
Decoding the Global Status register (General Status) - MCG_STATUS
- Bits 63 > 3 – Reserved
- Bit 2 – MCIP – Machine check in progress
- Bit 1 – EIPV – Error IP valid flag
- Bit 0 – RIPV – Restart IP valid flag
Decoding the Bank Status registers (MCA only)
Bit 62 – Overflow
Bit 61 – Error uncorrected
Bit 60 – Error enabled
Bit 59 – MCi_MISC register is valid
Bit 58 – MCi_ADDR register is valid
Bit 57 – PCC – Processor context is corrupt
Bits 56 -> 32 – Other information
Bits 31 -> 16 – Model-specific error code
Bits 15 -> 0 – Machine-check architecture (MCA) error code
Simple Error Code Encoding
|
Error Code |
Binary Encoding |
Meaning |
|
No Error |
0000 0000 0000 0000 |
No error has been reported to this bank of error-reporting registers. |
|
Unclassified |
0000 0000 0000 0001 |
This error has not been classified into the MCA error classes. |
|
Microcode ROM Parity |
0000 0000 0000 0010 |
Parity error in internal microcode ROM error. |
|
External Error |
0000 0000 0000 0011 |
The BINIT# from another processor caused this processor to enter machine check. |
|
FRC Error |
0000 0000 0000 0100 |
FRC (functional redundancy check) master/slave error. |
Compound Error Code Encoding
|
Type |
Form |
Interpretation |
|
TLB Errors |
0000 0000 0001 TTLL |
{TT}TLB{LL}_ERR |
|
Memory Hierarchy Errors |
0000 0001 RRRR TTLL |
{TT}CACHE{LL}_{RRRR}_ERR |
|
Bus and Interconnect Errors |
0000 1PPT RRRR IILL |
BUS{LL}_{PP}_{RRRR}_{II}_{T}_ERR |
|
Internal Timer |
0000 0100 0000 0000 |
Note: In this example, you need to match 0000 1110 0000 1111 to Bus and Interconnect error, and decode the PP,T,RRRR,II and LL values.
|
Transaction Type |
Mnemonic |
Binary Encoding |
|
Instruction |
I |
00 |
|
Data |
D |
01 |
|
Generic |
G |
10 |
Level Encoding for LL (Memory Hierarchy Level) Sub-Field
|
Hierarchy Level |
Mnemonic |
Binary Encoding |
|
Level 0 |
L0 |
00 |
|
Level 1 |
L1 |
01 |
|
Level 2 |
L2 |
10 |
|
Generic |
LG |
11 |
Encoding of Request (RRRR) Sub-Field
|
Request Type |
Mnemonic |
Binary Encoding |
|
Generic Error |
ERR |
0000 |
|
Generic Read |
RD |
0001 |
|
Generic Write |
WR |
0010 |
|
Data Read |
DRD |
0011 |
|
Data Write |
DWR |
0100 |
|
Instruction Fetch |
IRD |
0101 |
|
Prefetch |
PREFETCH |
0110 |
|
Eviction |
EVICT |
0111 |
|
Snoop |
SNOOP |
1000 |
Encodings of Participation (PP) Sub-Field
|
Transaction |
Mnemonic |
Binary Encoding |
|
Local processor originated request |
SRC |
00 |
|
Local processor responded to request |
RES |
01 |
|
Local processor observed error as 3rd party |
OBS |
10 |
|
Generic |
11 |
|
Transaction |
Mnemonic Binary Encoding |
|
Request timed out |
TIMEOUT 1 |
|
Request did not time out |
NOTIMEOUT 0 |
|
Transaction |
Mnemonic |
Binary Encoding |
|
Memory Access |
M |
00 |
|
Reserved |
01 | |
|
I/O |
IO |
10 |
|
Other transaction |
11 |
Example of decoding a bank status register
-
Check to see who the CPU manufacturer is.
Run the following command:
# cat /proc/cpuinfo
The output shows a line with vendor_id.
vendor_id : GenuineIntel
In this case the CPU manufacturer is Intel and thus, you are not able to determine the significance of the bank number. If it were AMD, the bank number provides more details. -
You see from the log entry, the value of MC0_STATUS is be0000001008081f . This value is in hexadecimal. You must convert this 64bit number to binary to further decode this information.
be0000001008081f = 1011 1110 0000 0000 0000 0000 0000 0000 0001 0000 0000 1000 0000 1000 0001 1111 -
Decode the 7 most significant bits first to see the general information contained within, that is, decode bits 57 - 63 which are 1011 111 .
Bit 63 = 1 – MCi_STATUS register contents are valid (if this bit is 0, then ignore all of the data in this register)
Bit 62 = 0 – An overflow did not occur
Bit 61 = 1 – Error was not corrected
Bit 60 = 1 – Error checking was enabled
Bit 59 = 1 – Contents of the MCi_MISC register is valid
Bit 58 = 1 – Contents of the MCi_ADDR register is valid
Bit 57 = 1 – Processor context is corrupt - register values are unreliable
Note: Bits are counted from 0 to 63, 0 being the least significant bit and 63 being the most significant bit. -
Decode the MCA error code. The MCA error code is the 16 lowest significant bits of the MCi_STATUS register, that is, decode bits 0 - 15 which are 0000 1000 0001 1111.
The pattern of these bits follows the complex error codes for bus and interconnect errors. The pattern is 0000 1PPT RRRR IILL.
Decoding the pattern you get:
0000 1000 0001 1111
0000 1PPT RRRR IILL
PP = 00 – Local processor was performing an operation that failed
T = 0 – Operation that was in progress did not time-out
RRRR = 0001 – Operation was a generic read operation
II = 11 – Operation was marked as Other Transaction, that is, not a regular memory or I/O operation
LL = 11 – Generic level, or not a cache based operation -
Summarize the findings:
It seems as a fault occured during a generic read operation to either normal memory space or an I/O device. -
Work on correcting any of the hardware issues. If required, contact the hardware vendor for further assistance.
Other considerations
-
The information reported by the machine-check architecture is to aid in troubleshooting the hardware issue. Occasionally the information decoded from the MCA error code is not enough. If more information is required, refer to the documentation from the CPU manufacturers for more details.
-
If the information is invalid but a machine-check exception has occurred, it may be that a fault occurred where enough information regarding the fault could not be recorded. In this case a hardware issue is still at fault.
-
Depending on the hardware fault, the vmware-log file for the failure may contain information for machine-check exceptions occurring for more than one CPU. Decoding the information reported by all CPUs may provide to be useful.
-
Providing the log entries containing the machine-check information to the hardware vendor may help with their investigation of the problems with the hardware.
Additional Information
Feedback
- KB Article: 1005184
- Updated: Nov 13, 2009
- Products:
VMware ESX
VMware ESXi - Product Versions:
VMware ESX 1.0.x
VMware ESX 1.5.x
VMware ESX 2.0.x
VMware ESX 2.1.x
VMware ESX 2.5.x
VMware ESX 3.0.x
VMware ESX 3.5.x
VMware ESX 4.0.x
VMware ESXi 3.5.x Embedded
VMware ESXi 3.5.x Installable
VMware ESXi 4.0.x Embedded
VMware ESXi 4.0.x Installable

