VMware
 

Knowledge Base

Search the Knowledge Base:

Products:
Search In:
 

Decoding Machine Check Exception (MCE) output after a purple screen error

Symptoms

  • A purple screen fault is encountered. The type of fault is Machine Check Exception (MCE).
  • The purple screen shows the following message:

    Machine Check Exception: Unable to continue

  • When extracting the logs from the core dump you see messages similar to:

    ALERT: MCE: 171: Machine Check Exception: Bank 5, Status b200001806000e0f

  • The system halts with a purple screen that looks similar to:

     

Resolution

This article contains the following sections: 

What is a Machine-Check Exception (MCE)?

The machine check archetecture is a mechanism within the CPU to detect and report hardware issues. When a problem is detected a machine-check exception (#MC) is thrown. If a machine-check exception has been thrown and a purple screen fault occurs, a hardware problem has caused it. There is no other way to throw a machine-check exception.

When the system has faulted with a purple screen, capture the screen output, then reboot the server and contact your hardware vendor. In the meantime, the information regarding the fault itself can be decoded to get a better idea of what may be happening .

Understanding the machine-check architecture

Recent CPUs from Intel and AMD implement a machine-check architecture that detects and reports hardware errors. Errors detected and reported on include, system bus errors, RAM (ECC and parity) errors, and other CPU errors (cache, TLB, etc). There are a set of model-specific registers (MSRs) that are used to configure and report errors.
 
There are three (3) global (or general) registers and four (4) or five (5) sets of bank registers. Each bank has four (4) registers. Each error reporting bank is associated with a specific (or group of) hardware unit(s).
 
The global registers are:
  • MCG_CAP – a read-only register that provides information about the machine-check architecture implementation
  • MCG_CTL – controls the reporting of machine-check exceptions
  • MCG_STATUS – reports information when a machine-check exception occurs
A set of bank registers include the following registers:
  • MCi_CTL – controls the reporting for the bank
  • MCi_STATUS – contains the information about the machine-check exception
  • MCi_ADDR – memory address of the exception (if deemed valid)
  • MCi_MISC – additional description of the machine-check exception (if deemed valid)
Note: For bank 3, the bank status register is known as MC3_STATUS. 
 
If the processor the fault occured on is an AMD the banks have the following meanings: 
  • Bank 0 – Data Cache
  • Bank 1 – Instruction Cache
  • Bank 2 – Bus Unit
  • Bank 3 – Load Store Unit
  • Bank 4 – Northbridge and DRAM

Decoding the machine-check exception

Extract the logs from the vmkernel-zdump file generated during the purple screen fault. For more information, see Extracting the log file after an ESX or ESXi host fails with a purple screen (1006796)
 
This sample log output indicates that a machine-check exception has occurred. It also shows the details reported by the system.
 
0:00:28:43.588 cpu0:1077)ALERT: MCE: 578: Machine Check Exception
0:00:28:43.588 cpu0:1077)ALERT: MCE: 169: Machine Check Exception: General Status 0000000000000004
0:00:28:43.588 cpu0:1077)ALERT: MCE: 193: Machine Check Exception: Bank 0, Status be0000001008081f
 
From this example you see that a machine-check exception has occurred on CPU 0 and there is information populated in the General Status register and the Bank 0 Status register. The General Status register is also known as the MCG_STATUS register and the Bank 0 Status register is know as MC0_STATUS . The key to translating the log messages with the register that is part of the machine-check architecture is summarized here.
 
For global registers, the logs only report the status register. This register is shown as the General Status register and refers to the Global Status register (MCG_Status). For bank registers, the logs show which bank it refers to and then which register. Thus, for the message "Bank 0, Status", it refers to the Status register for Bank 0 (MC0_Status). This articles described registers as MCi_STATUS, the i in the name should be changed for the bank number being reported.

Decoding the Global Status register (General Status) - MCG_STATUS

The Global Status register contains some simple information to indicate whether a machine-check exception has occurred. The register contains the following bits:  
 
 
 
The bits represented are as follows:
  • Bits 63 > 3 – Reserved
  • Bit 2 – MCIP – Machine check in progress
  • Bit 1 – EIPV – Error IP valid flag
  • Bit 0 – RIPV – Restart IP valid flag
The most important bit here is bit 2, the Machine Check in progress bit.
 
From this example, you see:
0:00:28:43.588 cpu0:1077)ALERT: MCE: 169: Machine Check Exception: General Status 0000000000000004
 
The value of MCG_STATUS is 0000000000000004 . If we convert this number to binary format you get 0100 . Hence bit 2 is set and tells you that a machine-check is in progress.

Decoding the Bank Status registers (MCA only)

The Bank Status registers have the same format as shown below. This register contains more detail of the machine-check exception.
 
The bits represented are as follows:
 
Bit 63 – MCi_STATUS register is valid
Bit 62 – Overflow
Bit 61 – Error uncorrected
Bit 60 – Error enabled
Bit 59 – MCi_MISC register is valid
Bit 58 – MCi_ADDR register is valid
Bit 57 – PCC – Processor context is corrupt
Bits 56 -> 32 – Other information
Bits 31 -> 16 – Model-specific error code
Bits 15 -> 0  – Machine-check architecture (MCA) error code
 
The MCA error code is covered here. See the Additional Information section for links that explain the rest of the bits in this type of register. The MCA error code is the 16 least significant bits of the MCi_STATUS register. The following details decoding the MCA error code.
 
When looking at the MCA error code, check to see if it is one of the simple or complex error codes. The following tables list the simple and complex codes. With complex codes, use the information in the table as a template to use with the other templates for more details of the error code.
Simple Error Code Encoding

Error Code

Binary Encoding

Meaning

No Error

0000 0000 0000 0000

No error has been reported to this bank of error-reporting registers.

Unclassified

0000 0000 0000 0001

This error has not been classified into the MCA error classes.

Microcode ROM Parity

0000 0000 0000 0010

Parity error in internal microcode ROM error.

External Error

0000 0000 0000 0011

The BINIT# from another processor caused this processor to enter machine check.

FRC Error

0000 0000 0000 0100

FRC (functional redundancy check) master/slave error.

Compound Error Code Encoding

Type

Form

Interpretation

TLB Errors

0000 0000 0001 TTLL

{TT}TLB{LL}_ERR

Memory Hierarchy Errors

0000 0001 RRRR TTLL

{TT}CACHE{LL}_{RRRR}_ERR

Bus and Interconnect Errors

0000 1PPT RRRR IILL

BUS{LL}_{PP}_{RRRR}_{II}_{T}_ERR

Internal Timer

0000 0100 0000 0000

Note: In this example, you need to match 0000 1110 0000 1111 to Bus and Interconnect error, and decode the PP,T,RRRR,II and LL values.

Encoding for TT (Transaction Type) Sub-Field

Transaction Type

Mnemonic

Binary Encoding

Instruction

I

00

Data

D

01

Generic

G

10


Level Encoding for LL (Memory Hierarchy Level) Sub-Field

Hierarchy Level

Mnemonic

Binary Encoding

Level 0

L0

00

Level 1

L1

01

Level 2

L2

10

Generic

LG

11

Encoding of Request (RRRR) Sub-Field

Request Type

Mnemonic

Binary Encoding

Generic Error

ERR

0000

Generic Read

RD

0001

Generic Write

WR

0010

Data Read

DRD

0011

Data Write

DWR

0100

Instruction Fetch

IRD

0101

Prefetch

PREFETCH

0110

Eviction

EVICT

0111

Snoop

SNOOP

1000

Encodings of Participation (PP) Sub-Field

Transaction

Mnemonic

Binary Encoding

Local processor originated request

SRC

00

Local processor responded to request

RES

01

Local processor observed error as 3rd party

OBS

10

Generic

11

Encodings of Time-out (T) Sub-Field

Transaction

Mnemonic Binary Encoding

Request timed out

TIMEOUT 1

Request did not time out

NOTIMEOUT 0

 
Encodings of Memory or I/O (II) Sub-Field

Transaction

Mnemonic

Binary Encoding

Memory Access

M

00

Reserved

01

I/O

IO

10

Other transaction

11

 

Example of decoding a bank status register 

From the example above, the information from this MC0_STATUS register of CPU 0 is decoded.
 
0:00:28:43.588 cpu0:1077)ALERT: MCE: 193: Machine Check Exception: Bank 0, Status be0000001008081f   
  1. Check to see who the CPU manufacturer is. 

    Run the following command:

    # cat /proc/cpuinfo

    The output shows a line with vendor_id.

    vendor_id       : GenuineIntel

    In this case the CPU manufacturer is Intel and thus, you are not able to determine the significance of the bank number. If it were AMD, the bank number provides more details.

  2. You see from the log entry, the value of MC0_STATUS is be0000001008081f . This value is in hexadecimal. You must convert this 64bit number to binary to further decode this information.

    be0000001008081f = 1011 1110 0000 0000 0000 0000 0000 0000 0001 0000 0000 1000 0000 1000 0001 1111

  3. Decode the 7 most significant bits first to see the general information contained within, that is, decode bits 57 - 63 which are 1011 111 .

    Bit 63 = 1 – MCi_STATUS register contents are valid (if this bit is 0, then ignore all of the data in this register)
    Bit 62 = 0 – An overflow did not occur
    Bit 61 = 1 – Error was not corrected
    Bit 60 = 1 – Error checking was enabled
    Bit 59 = 1 – Contents of the MCi_MISC register is valid
    Bit 58 = 1 – Contents of the MCi_ADDR register is valid
    Bit 57 = 1 – Processor context is corrupt - register values are unreliable

    Note: Bits are counted from 0 to 63, 0 being the least significant bit and 63 being the most significant bit.

  4. Decode the MCA error code. The MCA error code is the 16 lowest significant bits of the MCi_STATUS register, that is, decode bits 0 - 15 which are 0000 1000 0001 1111.

    The pattern of these bits follows the complex error codes for bus and interconnect errors. The pattern is 0000 1PPT RRRR IILL.

    Decoding the pattern you get:

    0000 1000 0001 1111
    0000 1PPT RRRR IILL

    PP   =   00  Local processor was performing an operation that failed
    T    =    0 Operation that was in progress did not time-out
    RRRR = 0001 Operation was a generic read operation
    II   =   11 Operation was marked as Other Transaction, that is, not a regular memory or I/O operation
    LL   =   11
    Generic level, or not a cache based operation


  5. Summarize the findings:

    It seems as a fault occured during a generic read operation to either normal memory space or an I/O device.

  6. Work on correcting any of the hardware issues. If required, contact the hardware vendor for further assistance.

Other considerations

  • The information reported by the machine-check architecture is to aid in troubleshooting the hardware issue. Occasionally the information decoded from the MCA error code is not enough. If more information is required, refer to the documentation from the CPU manufacturers for more details. 

  • If the information is invalid but a machine-check exception has occurred, it may be that a fault occurred where enough information regarding the fault could not be recorded. In this case a hardware issue is still at fault.

  • Depending on the hardware fault, the vmware-log file for the failure may contain information for machine-check exceptions occurring for more than one CPU. Decoding the information reported by all CPUs may provide to be useful. 

  • Providing the log entries containing the machine-check information to the hardware vendor may help with their investigation of the problems with the hardware.

Additional Information

More detailed information can be found from documentation by the CPU manufacturers:
For AMD based processors, a utility is available for use to automate the interpretation of the machine-check exception information that is recieved. This utility is called Machine Check Analysis Tool (MCAT).
 
A VMware Communities article provides other terms that are related to an ESX Purple Screen.

Feedback

Rating: 1 - Lowest 2 3 4 5 - Highest (9 Ratings)   

Did this article help you?
This article resolved my issue.
This article did not resolve my issue.
This article helped but additional information was required to resolve my issue.
What can we do to improve this information? (2000 or fewer characters)
Submit
Rating: 1 - Lowest 2 3 4 5 - Highest (9 Ratings)   
Actions