Knowledge Base

The VMware Knowledge Base provides support solutions, error messages and troubleshooting guides
 
Search the VMware Knowledge Base (KB)   View by Article ID
 

Decoding Machine Check Exception (MCE) output after a purple screen error (1005184)

Symptoms

  • An ESX/ESXi host halts with a purple diagnostic screen.

  • The purple diagnostic screen shows a message similar to:

    • Machine Check Exception: Unable to continue
    • Hardware (Machine) Error
    • PCPU: 1 hardware errors seen since boot (1 corrected by hardware)

  • When extracting the logs from the core dump you see messages similar to:

    • ALERT: MCE: 171: Machine Check Exception: Bank x, Status nnnnnnnnnnn
    • MC:PCPUn B:x S:nnnnnnnnnnn M:mmmmmmmmmmmm: A:aaaaaaaaaaa

  • On AMD systems you may see a message which indicates a hardware issue, but an MCE does not occur. The message is similar to:

    vmkernel: 72:03:47:16.847 cpu4:14403)MCE: 978: MCE not recoverable but did not generate an exception.

Purpose

The machine check architecture is a mechanism within a CPU to detect and report hardware issues. When a problem is detected, a machine check exception (MCE) is thrown. If an MCE is thrown and a purple diagnostic screen is displayed, a hardware problem has caused it. There is no other way to generate an MCE.

When the system has faulted with a purple screen, capture the screen output then reboot the server and contact your hardware vendor. In the meantime, the information regarding the fault itself can be decoded to get a better idea of what may be happening.

When an MCE purple diagnostic screen is observed, collect a screenshot, reboot, and collect the logs. For more information, see Collecting diagnostic information from an ESX or ESXi host that experiences a purple diagnostic screen (1004128).
 
Note: If you experience a purple diagnostic screen which does not mention MC, Machine Check Exception, or Hardware (Machine) Error, see Interpreting an ESX host purple diagnostic screen (1004250).

Resolution

Recent CPUs from Intel and AMD implement a machine-check architecture that detects and reports hardware issues, including system bus errors, RAM (ECC and parity) errors, and other CPU errors. There are a set of model-specific registers (MSRs) that are used to report errors.

When a hardware error occurs, global and bank-specific status machine-check architecture registers are populated with information regarding the cause, and whether the CPU can safely continue execution. In the case of a correctable error, ESX/ESXi reports the incident and register contents in the VMkernel logs. If an error is uncorrectable, and the CPU cannot continue safely, ESX/ESXi halts with a purple diagnostic screen.

During an MCE, the contents of the machine-check architecture registers are logged. The messages appear on the purple diagnostic screen itself and are recorded in the log file within the VMkernel zdump file. For more information, see Extracting the log file after an ESX or ESXi host fails with a purple screen error (1006796). If serial-line logging is configured, the same messages are emitted on the serial port. For more information, see Enabling serial-line logging for an ESX and ESXi host (1003900).

Machine-Check Architecture registers

The global MCA register (MCG_STATUS) reports whether an MCE is in progress, and if the instruction pointer pushed on to the stack can be used to reliably restart program execution or is directly associated with the error.

The global capabilities (MCG_CAP) register identifies the capabilities of the machine-check architecture of the processor. The lower 8 bits specify the number of hardware-unit error-reporting banks present in a particular processor. A bank of error-reporting registers are associated with a specific (or group of) hardware unit(s), though the association is vendor- and model-specific. For more information, see the vendor documentation listed in the Additional Information section of this article.

Each error-reporting bank is comprised of several registers. Of primary interest during a machine check exception is the bank's status register (MCi_STATUS), which contains detailed information regarding the machine check exception, and the address (MCi_ADDR) and miscellaneous (MCi_MISC) registers, which may provide additional information.

Identifying register contents

Different versions of ESX/ESXi log the machine-check architecture register contents using different formats. For more information, see Determining VMware Software Version and Build Number (392).

  • ESX/ESXi 3.x:

    The log message is verbose, consisting of three lines for each bank of interest. Each line contains the text MCE and starts with the physical CPU (C) that experienced the exception. The general status register is displayed on its own line. The status (s), address (a) and miscellaneous (m) registers from each bank of interest are displayed on their own line.

    cpuC:xxxx)Alert: MCE: Machine Check Exception: General Status 000000000000000n
    cpuC:xxxx)Alert: MCE: Machine Check Exception: Bank b, Status ssssssssssssssss
    cpuC:xxxx)Alert: MCE: Machine Check Exception: Bank b, Addr aaaaaaaaaaaaaaaa
    cpuC:xxxx)Alert: MCE: Machine Check Exception: Bank b, Misc mmmm


  • ESX/ESXi 4.0:

    The log message is terse, consisting of one line for each bank of interest. The general status register is displayed on its own line. Each line starts with the text MCE and contains the physical CPU number (C), the bank number (B), and the status (s), miscellaneous (m) and address (a) registers, followed by the general status register's value.

    Machine Check Global status on cpuC: 0x000000000000000n
    MCE on cpuC bankB: Status:0xssssssssssssssss Misc:0xmmmm Addr:0xaaaaaaaaaaaaaaaa

  • ESX/ESXi 4.1 and ESXi 5.0:

    The log message is terse, consisting of one line for each bank of interest. Each line starts with the text MC and contains the physical CPU number, the bank number, and the status, miscellaneous and address registers, followed by the general status register's value.

    MC:PCPUc B:b S:0xssssssssssssssss M:0xmmmm A:0xaaaaaaaaaaaaaaaan

Regardless of the version of ESX/ESXi, these items of information should be available:

  • Physical CPU number
  • Global status register
  • Bank number
  • Bank status register
  • Bank address register
  • Bank miscellaneous register

Automatic interpretation

VMware ESX/ESXi version 4.0 and later attempt to interpret the contents of the status register(s) for display on the purple diagnostic screen. For example:

  • MCE on cpuC bankB: Status:0xssssssssssssssss Misc:0xmmmm Addr:0xaaaaaaaaaaaaaaaa: Valid.UC.Err enabled.Misc valid.Addr valid.
  • Hardware (Machine) Error: Cache Hierarchy: Level 2 Instruction Cache Instruction Fetch Error. PCPUc in world xxxx:process

Note: Where the automatic interpretation and vendor interpretation disagree, the vendor's interpretation should be taken as correct. The raw contents of the status registers are also available, so they can be manually reviewed.

Decoding the global MCA status (MCG_STATUS) register

 
The global status register is 64-bits, but only the low 3 bits have meaning. The high 61 bits are reserved. The global status register can be converted to binary for comparing.

63
3
2
1
0
Reserved
MCIP
EIPV
RIPV

  • Bit 2: Machine Check In Progress. Identifies whether a machine check is in progress, and whether further fields should be consulted.
  • Bit 1: Error IP Valid. Identifies whether the instruction pointer pushed on to the stack is directly related to the error.
  • Bit 0: Restart IP Valid. Identifies whether the program execution can be reliably restarted at the instruction pointer pushed on to the stack.

For example, the global status register value "5" is equal to 0101 in binary. This translates to MCIP=1, EIPV=0, RIPV=1, which indicates that there is a machine check in progress, and the Restart IP is valid.

Overview of the bank status (MCi_STATUS) register

Each bank's MCi_STATUS register contains information related to a machine-check error. This information is only meaningful and logged if the Valid flag (bit 63) is set. This register is 64-bits wide.

63
62
61 60 59 58 57 56 32
31
16
15
0
VAL
OVER
UC EN MISCV ADDRV PCC Other Information
Extended Error Code
MCA Error Code

 
 
 
 
 
 
The high 7 bits 57:63 provide an overview of the processor state, and which of the other registers are meaningful:
  • Bit 63: VAL. Indicates (when set) that this bank's status (MCi_STATUS) register is valid, and that further fields should be consulted.
  • Bit 62: OVER. Indicates (when set) that a machine-check error occurred while the results of a previous error were still in the error-reporting register bank. May indicate that ESX/ESXi has not processed the MCE promptly, or that multiple MCEs occurred very close together.
  • Bit 61: UC. Indicates (when set) that the processor did not, or was not able to, correct the error condition. An ESX/ESXi host always generates a purple diagnostic screen when the processor indicates that the error condition was uncorrectable.
  • Bit 60: EN. Indicates (when set) that the error was enabled by the associated EEj bit of the MCi_CTL register. Will generally be 1.
  • Bit 59: MISCV. Indicates (when set) that the associated miscellaneous register (MCi_MISC) for this bank is valid, and contains additional information regarding the error.
  • Bit 58: ADDRV. Indicates (when set) that the associated address register (MCi_ADDR) for this bank is valid, and contains the memory address where the error occurred. Memory address may be physical or virtual, and dependent on the type of error encountered.
  • Bit 57: PCC. Indicates (when set) that the state of the processor may have been corrupted by the error condition, and that it may not be possible to reliably resume software execution.

Bits 56:32 contain other information, which may be reserved, used for counters, or hold other information that is model-specific. For more information, see the vendor documentation listed in the Additional Information section of this article.

Bits 31:16 contain a model-specific extended error code. For more information, see the vendor documentation listed in the Additional Information section of this article.

Bits 15:0 contains the machine-check architecture-defined error code for the machine-check error condition detected. These error codes are the same for all processors which implement the machine-check architecture, though individual processor models may define additional nuance. For more information, see the vendor documentation listed in the Additional Information section of this article.

Machine-check architecture-defined error codes in the bank status (MCi_STATUS) register

The machine-check architecture defines several errors which may be present in any bank's status register, grouped into Simple and Compound error codes. Identify the pattern which matches the contents of the status register.

Simple Error Codes reflect a specific fault, exactly matching the contents of the status register:

  • 0000 0000 0000 0000 – No error has been reported to this bank.
  • 0000 0000 0000 0001 – Unclassified. This error has not been classified into the MCA error classes. The additional information section may have meaning.
  • 0000 0000 0000 0010 – Parity error in internal microcode ROM
  • 0000 0000 0000 0011 – The BINT# from another processor caused this processor to enter machine-check.
  • 0000 0000 0000 0100 – Functional redundancy check (FRC) master/slave error.
  • 0000 0000 0000 0101 – Internal parity error.
  • 0000 0100 0000 0000 – Internal timer error.
  • 0000 01xx xxxx xxxx – Internal unclassified error. At least one x equals 1

Compound Error Codes follow a pattern, and define multiple aspects of the error with a single error number:

  • 000F 0000 0000 11LL – Generic cache hierarchy errors.
  • 000F 0000 0001 TTLL – TLB errors.
  • 000F 0000 1MMM CCCC – Memory controller errors (Intel-only).
  • 000F 0001 RRRR TTLL – Memory errors in the cache hierarchy.
  • 000F 1PPT RRRR IILL – Bus and interconnect errors.

Compound Error Code sub-fields define sections of a compound error code. Use these to populate the template defined by the compound error code:

  • Encoding of Transaction Type (TT) sub-field:

    • 00 – Instruction
    • 01 – Data
    • 10 – Generic
    • 11 – Reserved

  • Encoding of Memory Hierarchy Level (LL) sub-field:

    • 00 – Level 0
    • 01 – Level 1
    • 10 – Level 2
    • 11 – Generic

  • Encoding of memory transaction type (MMM) sub-field:

    • 000 – Generic undefined request
    • 001 – Memory read error
    • 010 – Memory write error.
    • 011 – Address or command error.
    • 100 – Memory scrubbing error.
    • 101-111 – Reserved.

  • Encoding of channel number (CCCC) sub-field:

    • 0000-1110 – Channel number.
    • 1111 – Channel not specified.

  • Encoding of Request (RRRR) sub-field:

    • 0000 – Generic error
    • 0001 – Generic read
    • 0010 – Generic write
    • 0011 – Data read
    • 0100 – Data write
    • 0101 – Instruction fetch
    • 0110 – Prefetch
    • 0111 – Evict
    • 1000 – Snoop (probe)

  • Encoding of Participation Processor (PP) sub-field:

    • 00 – Local node originated the request.
    • 01 – Local node responded to the request.
    • 10 – Local node observed error as third-party.
    • 11 – Generic

  • Encoding of Timeout (T) sub-field:

    • 0 – Request did not timeout.
    • 1 – Request did timeout.

  • Encoding of Memory/IO (II) sub-field:
    • 00 – Memory access
    • 01 – Reserved
    • 10 – I/O
    • 11 – Other

Model-specific error codes in the bank status (MCi_STATUS) and miscellaneous (MCi_MISC) registers

The machine-check architecture allows for bits or groups of bits within the bank status (MCi_STATUS) and miscellaneous (MCi_MISC) registers to take on additional meaning based on the processor model and the bank number. Listing the field meanings for all processor families is outside the scope of this article.

To interpret the additional contents of the bank status (MCi_STATUS) and miscellaneous (MCi_MISC) registers, review the documentation for the specific processor model. For more information, see the vendor documentation listed in the Additional Information section of this article or contact the hardware vendor.

Other considerations

  • Information reported by the machine-check architecture provides aid in troubleshooting a hardware issue. However, the information available from the MCA error code may be insufficient to root-cause the issue. If more information is required, refer to the processor documentation from the manufacturer.

  • Information reported by the machine-check architecture should be considered in context of other errors when attempting to determine a pattern of outages.

  • If invalid information is reported by the machine-check architecture, but an MCE occurred, this is still reflective of a hardware fault.

  • Providing the full machine-check architecture register contents to the hardware vendor may assist their investigation into the cause of the hardware fault.

Additional Information

In some cases the host might not fail with PSOD but the vmkernel/messages log report MCE errors, which shows as below

MCE: 215: CMCI on cpu1 bank8: Status:0xd000008000310080 Misc:0x0 Addr:0x0: Valid.Overflow.Err enabled.
MCE: 220: Status bits: "Memory Controller Error on Channel 0.

More detailed information can be found in documentation by the CPU manufacturers:

Note: The preceding links were correct as of August 7, 2012. If you find the link is broken, provide feedback and a VMware employee will update the link.

Tags

decoding-mce-output

See Also

Update History

02/19/2014 - Added ESXi 5.5 to products

Request a Product Feature

To request a new product feature or to provide feedback on a VMware product, please visit the Request a Product Feature page.

Feedback

  • 109 Ratings

Did this article help you?
This article resolved my issue.
This article did not resolve my issue.
This article helped but additional information was required to resolve my issue.
What can we do to improve this information? (4000 or fewer characters)
  • 109 Ratings
Actions
KB: