Recent CPUs from Intel and AMD implement a machine-check architecture that detects and reports hardware issues, including system bus errors, RAM (ECC and parity) errors, and other CPU errors. There are a set of model-specific registers (MSRs) that are used to report errors.
When a hardware error occurs, global and bank-specific status machine-check architecture registers are populated with information regarding the cause, and whether the CPU can safely continue execution. In the case of a correctable error, ESX/ESXi reports the incident and register contents in the VMkernel logs. If an error is uncorrectable, and the CPU cannot continue safely, ESX/ESXi halts with a purple diagnostic screen.
During an MCE, the contents of the machine-check architecture registers are logged. The messages appear on the purple diagnostic screen itself and are recorded in the log file within the VMkernel zdump file. For more information, see Extracting the log file after an ESX or ESXi host fails with a purple screen error (1006796). If serial-line logging is configured, the same messages are emitted on the serial port. For more information, see Enabling serial-line logging for an ESX and ESXi host (1003900).
Machine-Check Architecture registers
The global MCA register (MCG_STATUS) reports whether an MCE is in progress, and if the instruction pointer pushed on to the stack can be used to reliably restart program execution or is directly associated with the error.
The global capabilities (MCG_CAP) register identifies the capabilities of the machine-check architecture of the processor. The lower 8 bits specify the number of hardware-unit error-reporting banks present in a particular processor. A bank of error-reporting registers are associated with a specific (or group of) hardware unit(s), though the association is vendor-and model-specific. For more information, see the vendor documentation listed in the Additional Information section of this article.
Each error-reporting bank is comprised of several registers. Of primary interest during a machine check exception is the status register (MCi_STATUS) of the bank, which contains detailed information regarding the machine check exception, and the address (MCi_ADDR) and miscellaneous (MCi_MISC) registers, which may provide additional information.
Identifying register contents
Different versions of ESX/ESXi log the machine-check architecture register contents using different formats. For more information, see Determining VMware Software Version and Build Number (392).
- ESX/ESXi 3.x:
The log message is verbose, consisting of three lines for each bank of interest. Each line contains the text
MCE and starts with the physical CPU (C) that experienced the exception. The general status register is displayed on its own line. The status (s), address (a) and miscellaneous (m) registers from each bank of interest are displayed on their own line.
cpuC:xxxx)Alert: MCE: Machine Check Exception: General Status 000000000000000n
cpuC:xxxx)Alert: MCE: Machine Check Exception: Bank b, Status ssssssssssssssss
cpuC:xxxx)Alert: MCE: Machine Check Exception: Bank b, Addr aaaaaaaaaaaaaaaa
cpuC:xxxx)Alert: MCE: Machine Check Exception: Bank b, Misc mmmm
- ESX/ESXi 4.0:
The log message is terse, consisting of one line for each bank of interest. The general status register is displayed on its own line. Each line starts with the text
MCE and contains the physical CPU number (C), the bank number (B), and the status (s), miscellaneous (m) and address (a) registers, followed by the general status register's value.
Machine Check Global status on cpuC: 0x000000000000000n
MCE on cpuC bankB: Status:0xssssssssssssssss Misc:0xmmmm Addr:0xaaaaaaaaaaaaaaaa
- ESX/ESXi 4.1 and ESXi 5.0:
The log message is terse, consisting of one line for each bank of interest. Each line starts with the text
MC and contains the physical CPU number, the bank number, and the status, miscellaneous and address registers, followed by the value of the general status register.
MC:PCPUc B:b S:0xssssssssssssssss M:0xmmmm A:0xaaaaaaaaaaaaaaaan
Regardless of the version of ESX/ESXi, these items of information should be available:
- Physical CPU number
- Global status register
- Bank number
- Bank status register
- Bank address register
- Bank miscellaneous register
VMware ESX/ESXi version 4.0 and later attempt to interpret the contents of the status register(s) for display on the purple diagnostic screen.
MCE on cpuC bankB: Status:0xssssssssssssssss Misc:0xmmmm Addr:0xaaaaaaaaaaaaaaaa: Valid.UC.Err enabled.Misc valid.Addr valid.
Hardware (Machine) Error: Cache Hierarchy:_Level_2_Instruction_Cache_Instruction_Fetch_Error. PCPUc in world xxxx:process
Note: Where the automatic interpretation and vendor interpretation disagree, the interpretation of the vendor should be taken as correct. The raw contents of the status registers are also available, so they can be manually reviewed.
Decoding the global MCA status (MCG_STATUS) register
The global status register is 64-bits, but only the low 3 bits have meaning. The high 61 bits are reserved. The global status register can be converted to binary for comparing.
- Bit 2: Machine Check In Progress. Identifies whether a machine check is in progress, and whether further fields should be consulted.
- Bit 1: Error IP Valid. Identifies whether the instruction pointer pushed on to the stack is directly related to the error.
- Bit 0: Restart IP Valid. Identifies whether the program execution can be reliably restarted at the instruction pointer pushed on to the stack.
For example, the global status register value "5" is equal to 0101 in binary. This translates to MCIP=1, EIPV=0, RIPV=1, which indicates that there is a machine check in progress, and the Restart IP is valid.
Overview of the bank status (MCi_STATUS) register
Each bank's MCi_STATUS register contains information related to a machine-check error. This information is only meaningful and logged if the Valid flag (bit 63) is set. This register is 64-bits wide.
|VAL||OVER||UC||EN||MISCV||ADDRV||PCC||Other Information||Extended Error Code||MCA Error Code|
The high 7 bits 57:63 provide an overview of the processor state, and which of the other registers are meaningful:
- Bit 63: VAL. Indicates (when set) that this bank's status (MCi_STATUS) register is valid, and that further fields should be consulted.
- Bit 62: OVER. Indicates (when set) that a machine-check error occurred while the results of a previous error were still in the error-reporting register bank. May indicate that ESX/ESXi has not processed the MCE promptly, or that multiple MCEs occurred very close together.
- Bit 61: UC. Indicates (when set) that the processor did not, or was not able to, correct the error condition. An ESX/ESXi host always generates a purple diagnostic screen when the processor indicates that the error condition was uncorrectable.
- Bit 60: EN. Indicates (when set) that the error was enabled by the associated EEj bit of the MCi_CTL register. Will generally be 1.
- Bit 59: MISCV. Indicates (when set) that the associated miscellaneous register (MCi_MISC) for this bank is valid, and contains additional information regarding the error.
- Bit 58: ADDRV. Indicates (when set) that the associated address register (MCi_ADDR) for this bank is valid, and contains the memory address where the error occurred. Memory address may be physical or virtual, and dependent on the type of error encountered.
- Bit 57: PCC. Indicates (when set) that the state of the processor may have been corrupted by the error condition, and that it may not be possible to reliably resume software execution.
Bits 56:32 contain other information, which may be reserved, used for counters, or hold other information that is model-specific. For more information, see the vendor documentation listed in the Additional Information section of this article.
Bits 31:16 contain a model-specific extended error code. For more information, see the vendor documentation listed in the Additional Information section of this article.
Bits 15:0 contains the machine-check architecture-defined error code for the machine-check error condition detected. These error codes are the same for all processors which implement the machine-check architecture, though individual processor models may define additional nuance. For more information, see the vendor documentation listed in the Additional Information section of this article.
Machine-check architecture-defined error codes in the bank status (MCi_STATUS) register
The machine-check architecture defines several errors which may be present in any bank's status register, grouped into Simple and Compound error codes. Identify the pattern which matches the contents of the status register.
Simple Error Codes reflect a specific fault, exactly matching the contents of the status register:
0000 0000 0000 0000 – No error has been reported to this bank.
0000 0000 0000 0001 – Unclassified. This error has not been classified into the MCA error classes. The additional information section may have meaning.
0000 0000 0000 0010 – Parity error in internal microcode ROM
0000 0000 0000 0011 – The BINT# from another processor caused this processor to enter machine-check.
0000 0000 0000 0100 – Functional redundancy check (FRC) master/slave error.
0000 0000 0000 0101 – Internal parity error.
0000 0100 0000 0000 – Internal timer error.
0000 01xx xxxx xxxx – Internal unclassified error. At least one x equals 1
Compound Error Codes follow a pattern, and define multiple aspects of the error with a single error number:
000F 0000 0000 11LL – Generic cache hierarchy errors.
000F 0000 0001 TTLL – TLB errors.
000F 0000 1MMM CCCC – Memory controller errors (Intel-only).
000F 0001 RRRR TTLL – Memory errors in the cache hierarchy.
000F 1PPT RRRR IILL – Bus and interconnect errors.
Compound Error Code sub-fields define sections of a compound error code. Use these to populate the template defined by the compound error code:
- Encoding of Transaction Type (TT) sub-field:
00 – Instruction
01 – Data
10 – Generic
11 – Reserved
- Encoding of Memory Hierarchy Level (LL) sub-field:
00 – Level 0
01 – Level 1
10 – Level 2
11 – Generic
- Encoding of memory transaction type (MMM) sub-field:
000 – Generic undefined request
001 – Memory read error
010 – Memory write error.
011 – Address or command error.
100 – Memory scrubbing error.
101-111 – Reserved.
- Encoding of channel number (CCCC) sub-field:
0000-1110 – Channel number.
1111 – Channel not specified.
- Encoding of Request (RRRR) sub-field:
0000 – Generic error
0001 – Generic read
0010 – Generic write
0011 – Data read
0100 – Data write
0101 – Instruction fetch
0110 – Prefetch
0111 – Evict
1000 – Snoop (probe)
- Encoding of Participation Processor (PP) sub-field:
00 – Local node originated the request.
01 – Local node responded to the request.
10 – Local node observed error as third-party.
11 – Generic
- Encoding of Timeout (T) sub-field:
0 – Request did not timeout.
1 – Request did timeout.
- Encoding of Memory/IO (II) sub-field:
00 – Memory access
01 – Reserved
10 – I/O
11 – Other
Model-specific error codes in the bank status (MCi_STATUS) and miscellaneous (MCi_MISC) registers
The machine-check architecture allows for bits or groups of bits within the bank status (MCi_STATUS) and miscellaneous (MCi_MISC) registers to take on additional meaning based on the processor model and the bank number. Listing the field meanings for all processor families is outside the scope of this article.
To interpret the additional contents of the bank status (MCi_STATUS) and miscellaneous (MCi_MISC) registers, review the documentation for the specific processor model. For more information, see the vendor documentation listed in the Additional Information section of this article or contact the hardware vendor.
- Information reported by the machine-check architecture provides aid in troubleshooting a hardware issue. However, the information available from the MCA error code may be insufficient to root-cause the issue. If more information is required, refer to the processor documentation from the manufacturer.
- Information reported by the machine-check architecture should be considered in context of other errors when attempting to determine a pattern of outages.
- If the machine-check architecture reports invalid information, but an MCE has occurred, this is still reflective of a hardware fault.
- Providing the full machine-check architecture register contents to the hardware vendor may assist their investigation into the cause of the hardware fault.