ESXi hosts become unresponsive (ESXi 6.5/6.7, qlnativefc 2.1.96.0/3.1.31.0, HPE DL580 Gen9/Gen10)

Products

VMware vSphere ESXi

Issue/Introduction

This article informs about the current state of the symptoms, cause, available workaround and related information for ESXi hosts on versions 6.5 or 6.7 marked unresponsive specific to HPE DL580 Gen9/Gen10 and the HBA driver qlnativefc 2.1.96.0/3.1.31.0.

Symptoms:

Due to a potential server cache coherency issue on some server platforms using specific processor models, ESXi 6.5 and ESXi 6.7 hosts configured with Marvell FC adapters and running qlnativefc drivers become unresponsive.
Leading up to the condition, vmkernel logs indicate the following messages:

2020-02-23T08:40:06.175Z cpu0:65987)qlnativefc: vmhba2(41:0.0): qlnativefcStatusEntry(1): Already returned command for status handle (0x10365).
...
2020-02-23T08:43:54.530Z cpu15:66058)WARNING: ScsiPath: 8255: Set retry timeout for failed TaskMgmt abort for CmdSN 0x0, status Failure, path vmhba2:C0:T2:L88
2020-02-23T08:43:54.531Z cpu15:66058)WARNING: ScsiPath: 8255: Set retry timeout for failed TaskMgmt abort for CmdSN 0x0, status Failure, path vmhba2:C0:T2:L12
2020-02-23T08:43:54.531Z cpu15:66058)WARNING: ScsiPath: 8255: Set retry timeout for failed TaskMgmt abort for CmdSN 0x0, status Failure, path vmhba2:C0:T2:L13
2020-02-23T08:43:54.532Z cpu15:66058)WARNING: ScsiPath: 8255: Set retry timeout for failed TaskMgmt abort for CmdSN 0x0, status Failure, path vmhba2:C0:T2:L14
2020-02-23T08:43:54.532Z cpu15:66058)WARNING: ScsiPath: 8255: Set retry timeout for failed TaskMgmt abort for CmdSN 0x0, status Failure, path vmhba2:C0:T2:L15
2020-02-23T08:43:54.532Z cpu15:66058)WARNING: ScsiPath: 8255: Set retry timeout for failed TaskMgmt abort for CmdSN 0x0, status Failure, path vmhba2:C0:T2:L16
2020-02-23T08:43:54.533Z cpu15:66058)WARNING: ScsiPath: 8255: Set retry timeout for failed TaskMgmt abort for CmdSN 0x0, status Failure, path vmhba2:C0:T2:L17
2020-02-23T08:43:54.630Z cpu15:66058)WARNING: ScsiPath: 8255: Set retry timeout for failed TaskMgmt abort for CmdSN 0x0, status Failure, path vmhba2:C0:T0:L66

Note:The preceding log excerpts are only examples. Date,time and environmental variables may vary depending on your environment.

Environment

VMware vSphere ESXi 6.5
VMware vSphere ESXi 6.7

Cause

Driver processes the response queue in the following manner:
1)   Start of the day, all entries in the response queue are updated to a pre-defined signature.
2)   As part of posting a completion in the response queue, firmware modifies the pre-defined signature and updates the producer index
3)   Driver processes the response queue in two contexts
a)   in the command submission path to decrease the interrupt latency and
b)   in the interrupt context.
4)   As part of processing the response queue, driver walks through the entries in the queue, starting from the consumer index, checking for any entry that does not have the pre-defined signature.
5)   If an entry is found, driver processes the entry and updates the pre-defined signature. Driver also posts the updated consumer index (based on entries consumed)

Analysis of the logs indicates that the qlnativefc driver first encounters a condition of a stale completion being seen on the response queue, identified by the below message.

2020-02-23T08:40:06.175Z cpu0:65987)qlnativefc: vmhba2(41:0.0): qlnativefcStatusEntry(1): Already returned command for status handle (0x10365).

While the driver stops processing this entry based on other parameters, the consumer index is advanced as an entry was consumed, causing it to move past the producer and triggering a queue-full condition.
As a result, the subsequent commands and abort requests posted to the firmware do not complete as the firmware does not find a slot to post the completion.
Eventually the pending outstanding commands leads to the host becoming unresponsive.

Observations
•   Issue was reported on the following server configurations:
o   HPE DL580 Gen9
   Intel(R) Xeon(R) CPU E7-8891 v4 @ 2.80GHz, CPU packages 4, BIOS U17, 2.73
   Intel(R) Xeon(R) CPU E7-4850 v4 @ 2.10GHz
   Intel(R) Xeon(R) CPU E7-8880 v4 @ 2.20GHz, CPU packages 4, BIOS U17
   Intel(R) Xeon(R) CPU E7-8867 v4 @ 2.40GHz, CPU packages 4, BIOS U17
   Intel(R) Xeon(R) CPU E7-8890 v4 @ 2.20GHz, CPU packages 4, BIOS U17
o   HPE DL580 Gen10
   Intel(R) Xeon(R) Gold 6254 CPU @ 3.10GHz, CPU Packages 4
•   Problem has been reported with various qlnativefc driver versions, starting with 2.1.96.0/3.1.31.0
o   While the problem occurrence was more visible with newer qlnativefc driver versions, there are customers who have reported the issue with the older version as well.
•   Marvell has not been able to reproduce the issue in house so far.
•   Multiple customers were provided with debug drivers to identify the cause for the stale entry in the response queue, but required debug data was not collected (either customers could not repro with debug driver or not ready to try the debug driver).

Related finding

A similar issue with matching signature was reported on Marvell FC drivers on Linux OS on an HPE DL560 Gen9 server with Intel E7-4830v4 2.0GHz processor. In this case, customer had the QPI setting configured as “cluster-on-die” in server BIOS. HPE recommended the customer to update the system BIOS to v2.60 or later. More details are available in the HPE customer advisory .

The server running Linux OS continued to experience the failure with the same signature even with the updated system BIOS.

Marvell was able to reproduce the failure on Linux OS in-house with the latest system BIOS. Based on data collected, Marvell has established the following:
•   Data from the Linux environment indicates a server cache coherency issue on the server platforms using the specific processor model.
•   The server cache coherency causes driver’s update of the pre-defined signature on an entry processed by driver in the response queue from CPU#1 (in the context of command submission path, step 3.a above) to not be flushed to the memory for ~100ms
•   A subsequent read of the same entry from CPU#2 (in the context of interrupt, step 3.b above) shows the entry without the pre-defined signature, causing the driver to consider the entry as valid and advance the index
•   This condition leads to the firmware considering a queue full condition.

Marvell has shared the findings from the in-house repro and has engaged with HPE for further analysis from the server platform side on the specific coherency.

Resolution

Based on the data collected on Linux and very similar signatures on ESXi, Marvell believes the core problem lies outside of the Marvell FC driver domain and is related to the specific platform/processors.

As such, currently there is no resolution that can be provided via an update to the FC driver. The eventual fix will need to come from server platform OEM. So please engage the server platform vendor.

Workaround:
A possible workaround is to prevent the driver from accessing the response queue from different CPUs.

On ESX, this can be achieved on existing certified drivers by disabling the ZIO (Zero Interrupt Operation) mode in the qlnativefc driver.

ZIO can be disabled using the following command:

$ esxcfg-module -s “ql2xoperationmode=0” qlnativefc

Note: A server reboot is required for the setting to be effective.

Additional Information

Impact/Risks:
ESXi is marked non-responding for vCenter requests
Fibre Channel Storage connecitivty is lost which affects Fibre Channel connected datastores