ESXi PSOD in ReportLun path - exposed when the target goes on and off continuously

Products

VMware vSphere ESXi

Issue/Introduction

Symptoms:
Due to a race condition, customer may experience PSOD in ESXi host running NFNIC driver 4.0.0.63 and 4.0.0.65 during rapid FC topology changes.

Vmkernel logs or var/core/vmkernel-zdump.1.FRAG logs would show similar behavior. The snippet below is after one of the final link flaps that triggered the PSOD.

2021-03-01T06:06:21.756Z cpu38:2098030)nfnic: <1>: INFO: fnic_tport_exch_reset: 4464: Tport exch reset: target id: 14 tport->fcid: 0x0a1100
2021-03-01T06:06:21.756Z cpu38:2098030)nfnic: <1>: INFO: fnic_tport_cleanup_io: 3957: ABTS is pending
2021-03-01T06:06:21.756Z cpu38:2098030)nfnic: <1>: INFO: fnic_tport_cleanup_io: 3958: IOREQ 0x459b44800740:
port_id=1376064
start_time = 231565857095324
abort event = 0
requiredlen = 8208
Status = 12582979
Message = 25$
2021-03-01T06:06:23.338Z cpu15:2098187)nfnic: <1>: INFO: fnic_fcpio_icmnd_cmpl_handler: 1696: io_req: 0x459b44800740 sc: 0x430e46988950 tag: 0x750 CMD_FLAGS: 0xc00053 CMD_STATE: FNIC_IOREQ_ABTS_PENDING ABTS pending hdr status: FCPIO_ABORTED scsi_status:$
2021-03-01T06:06:23.338Z cpu15:2098187)nfnic: <1>: INFO: fnic_fcpio_itmf_cmpl_handler: 2173: fcpio hdr status: FCPIO_TIMEOUT
2021-03-01T06:06:23.338Z cpu15:2098187)nfnic: <1>: INFO: fnic_fcpio_itmf_cmpl_handler: 2227: io_req: 0x459b44800740 sc: 0x430e46988950 id: 0x750 CMD_FLAGS: 0xc00073 CMD_STATE: FNIC_IOREQ_ABTS_PENDINGhdr status: FCPIO_TIMEOUT ABTS cmpl received
[7m2021-03-01T06:06:23.338Z cpu15:2098187)WARNING: nfnic: <1>: fnic_process_driverIO: 1517: tport wwpn: 0x50000975a8112233 fcid: 0x0a1100 hstatus: 1 dstatus: 0[0m

**** PSOD occurs here due to race condition related to removing a TPort with outstanding IO/ABTS or Report Luns request.

2021-03-01T06:06:23.396Z cpu15:2098187)World: 3015: PRDA 0x418043c00000 ss 0x0 ds 0x10b es 0x10b fs 0x10b gs 0x0
2021-03-01T06:06:23.396Z cpu15:2098187)World: 3017: TR 0xfd8 GDT 0x451b0aea1000 (0xfe7) IDT 0x41802c965000 (0xfff)
2021-03-01T06:06:23.396Z cpu15:2098187)World: 3018: CR0 0x80010031 CR3 0x806f41f000 CR4 0x142768
2021-03-01T06:06:23.426Z cpu15:2098187)Backtrace for current CPU #15, worldID=2098187, fp=0x430e468f6040
2021-03-01T06:06:23.426Z cpu15:2098187)0x451b16f1b8a0:[0x41802c90be95]PanicvPanicInt@vmkernel#nover+0x439 stack: 0x0, 0x41802cc4d3b8, 0x451b16f1b948, 0x0, 0x1
2021-03-01T06:06:23.426Z cpu15:2098187)0x451b16f1b940:[0x41802c90c0c8]Panic_NoSave@vmkernel#nover+0x4d stack: 0x451b16f1b9a0, 0x451b16f1b960, 0x7fffffff7fffffff, 0xf, 0x20040b
2021-03-01T06:06:23.426Z cpu15:2098187)0x451b16f1b9a0:[0x41802c82dd16]LockCheckSelfDeadlockInt@vmkernel#nover+0x5b stack: 0xd29ce77330d2, 0x41802c9124e0, 0x1, 0x41802cb1b0eb, 0x451b10923780
2021-03-01T06:06:23.426Z cpu15:2098187)0x451b16f1b9c0:[0x41802c9124df]MCS_LockWait@vmkernel#nover+0x104 stack: 0x451b10923780, 0x417fecbfa768, 0x0, 0x923780, 0x41804bc00080
2021-03-01T06:06:23.426Z cpu15:2098187)0x451b16f1ba90:[0x41802c9129aa]MCSLockWithFlagsWork@vmkernel#nover+0x23 stack: 0x451b15aa3000, 0x1, 0x430e468d1290, 0x41802d21b464, 0x1
2021-03-01T06:06:23.426Z cpu15:2098187)0x451b16f1baa0:[0x41802c83d2f3]vmk_SpinlockLock@vmkernel#nover+0x18 stack: 0x430e468d1290, 0x41802d21b464, 0x1, 0x41802c891ddf, 0x41802c947720
2021-03-01T06:06:23.426Z cpu15:2098187)0x451b16f1bac0:[0x41802d21b463]fnic_fcpio_icmnd_cmpl_handler@(nfnic)#+0xdc stack: 0x41802c947720, 0x2, 0x0, 0x41802c8cd516, 0x418043c00000
2021-03-01T06:06:23.426Z cpu15:2098187)0x451b16f1bb90:[0x41802d21ee65]fnic_fcpio_cmpl_handler@(nfnic)#+0x16e stack: 0x41802c8f1051, 0x0, 0x430c18391fa0, 0x430e468d4860, 0xffffffff

Environment

VMware vSphere ESXi 6.7
VMware vSphere ESXi 7.x

Cause

This has been seen in environment where FC links are going on and offline quickly, before some FC failure mitigation timeouts complete while there is outstanding IO on that link.

This condition is rare.

This issue is present in NFNIC driver versions from 4.0.0.59 to 4.0.0.65

Resolution

This issue is corrected in the 4.0.0.70 NFNIC Driver
https://customerconnect.vmware.com/downloads/details?downloadGroup=DT-ESXI67-CISCO-NFNIC-40070&productId=742

Workaround:
Until new version of driver is released, Cisco recommends addressing any underlying hardware conditions with SFPs, Ports, etc. that expose this condition.

Additional Information

https://bst.cloudapps.cisco.com/bugsearch/bug/CSCvx72162