VMware
 

Knowledge Base

Search the Knowledge Base:

Products:
Search In:
 

ESX Server Host and Virtual Machines Not Responding after Clicking the Rescan Button to Scan for New Storage Devices

Details

Problem Description

While running ESX Server 3.0.0 or 3.0.1, some systems stop responding for a period of two to five minutes during a rescan operation and on occasion the system can completely stop responding.

Symptoms and Elements of the Problem

The following symptoms occur on systems experiencing this problem.
  1. The system stops responding for a period of two to five minutes or there is a total lockup of the service console.

  2. During this period, the service console and virtual machines are not responsive.

  3. The ESX Server host usually recovers at the end of this time period, but some of the systems may not recover.

  4. Systems from multiple hardware vendors, with both Intel and AMD processors can be affected.

  5. Affected systems have been found to have the following criteria:


    1. USB modules loaded in the service console

    2. At least one local VMFS volume

    3. At least one SAN VMFS volume

    4. Rescan of all devices and all VMFS volumes at the same time

    5. There are two or more devices configured that have a shared IRQ assignment. To determine this, perform the steps in the following section, "How to Tell if You Have Shared IRQ Assignments."

How to Tell if You Have Shared IRQ Assignments

To determine whether you may be affected by this problem, start by listing the interrupt line usage. In the service console, type:

cat /proc/vmware/interrupts

This lists the interrupt usage. The output looks similar to this example:
 
Vector    PCPU  0    PCPU  1
0x21:         163          0 COS irq 1 (ISA edge)
0x29:           0          0
0x31:           1          0 VMK serial
0x39:           0          0
0x41:           0          0
0x49:           0          0
0x51:           0          0
0x59:           1          0 COS irq 12 (ISA edge)
0x61:        1885          0 COS irq 14 (ISA edge)
0x69:           1          0 COS irq 15 (ISA edge)
0x71:          30          0 COS irq 19 (PCI level), VMK aic7xxx
0x79:           1      52596 <COS irq 17 (PCI level)>, VMK vmnic0
0x81:       66860          0 COS irq 16 (PCI level)
0x89:          42         46 <COS irq 18 (PCI level)>, VMK aic7xxx
0xdf:     3588590    3589262 VMK timer
0xe1:           0          0 VMK ipi
0xe9:           4          1 VMK resched
0xf1:           3          0 VMK tlb
0xf9:        2871          0 VMK noop
0xfc:           0          0 VMK thermal
0xfd:           0          0 VMK lint1
0xfe:           0          0 VMK error
0xff:           0          0 VMK spurious  
Examine the output for controllers with shared interrupts. Ignore devices in angle brackets; any line with both a VMK entry and a COS entry (without angle brackets) indicates a possible problem with a shared interrupt.

The example above contains a line at vector 0x71 with both VMK and COS devices, another line at vector 0x79 with both VMK and COS (in angle brackets), and a third line at vector 0x89 with both VMK and COS (in angle brackets). You can ignore the latter two lines with the COS devices in angle brackets, and focus on the line at vector 0x71.

Next, list the PCI device assignments (VMkernel or service console) for the shared interrupt lines.

cat /proc/vmware/pci

This lists all PCI devices present in the machine. The output looks similar to this:
 

Bus:Sl.F Vend:Dvid Subv:Subd Type     Vendor   ISA/irq/Vec P M Module Name

                                               Spawned bus

000:00.0 8086:1a21 1028:0096 Host/PCI Intel                  C

000:01.0 8086:1a23 0000:0000 PCI/PCI  Intel        001       C

000:30.0 8086:2418 0000:0000 PCI/PCI  Intel        002       C

000:31.0 8086:2410 0000:0000 PCI/ISA  Intel                  C

000:31.1 8086:2411 8086:2411 IDE      Intel                  C

000:31.2 8086:2412 8086:2412 USB      Intel    11/ 19/0x71 D C

000:31.3 8086:2413 8086:2413 SMBus    Intel    11/ 17/0x79 B C

001:00.0 10de:0150 10de:002e Display  NVidia    9/ 16/0x81 A C

002:04.0 10b7:9200 1028:0096 Ethernet 3Com     16/ 16/0x81 A C

002:06.0 1013:6003 1028:0096 Audio    0x1013   10/ 18/0x89 A C

002:09.0 8086:1229 8086:000c Ethernet Intel    11/ 17/0x79 A V e100    vmnic0

002:14.0 1011:0024 0000:0000 PCI/PCI  DEC          003       C

003:10.0 9005:00cf 1028:0096 SCSI     Adaptec  10/ 18/0x89 A V aic7xxx vmhba0

003:10.1 9005:00cf 1028:0096 SCSI     Adaptec  11/ 19/0x71 B V aic7xxx vmhba1

Use the interrupt vectors of the shared interrupt lines to index into the PCI devices output, and determine the affected controllers. The interrupt vector is found in the ISA/irq/Vec column of the output.

A mode value of C indicates the device is dedicated to the service console. V indicates the device is dedicated to the VMkernel.

You are affected by the problem if you identify a group of controllers that share the same Vec number and both of these are true:

  • One controller has mode C (managed by the service console).
  • Another controller has mode V (managed by the VMkernel).
You are not affected by the problem if, for every group of controllers that share the same Vec number:
  • All controllers in the group have mode C (assigned to the service console).
  • All controllers in the group have mode V (assigned to the VMkernel).

To continue the example, the only interrupt vector of concern is 0x71, which is found in two rows of this output. The affected controllers are the Intel USB and the Adaptec SCSI controller, which share interrupt 19 and vector 0x71.

USB has mode C (meaning it is assigned to the service console) while vmhba1 has mode V (meaning it is assigned to the VMkernel). In this example, the controllers that share the interrupt line are managed by different entities. This is likely a system that will encounter the rescan problem.

Root Cause

The rescan system not responding is due to a dead lock between two requests coming from the service console. These requests are serviced by VMkernel on behalf of the service console.

Solution

Workarounds

Depending on each site's situation and configuration, some of the following workarounds may be more operational than others.
  1. When performing a rescan from the Virtual Infrastructure Client, the Rescan dialog box appears.



    Leaving both boxes checked can cause the rescan hang, or the service console lockup. The workaround is to uncheck one of the boxes and then click OK. Perform a second rescan and uncheck the other box to complete the full set of rescans.

  2. Rescan one host bust adapter (HBA) at a time, proceeding through all Fibre channel HBAs, either manually or using a script containing a rescan command line. This is highly effective, but requires more administrative work. The command line to use is:

    esxcfg-rescan <vmhba#>

    For example, to rescan vmhba1, run the command line:

    esxcfg-rescan vmhba1

  3. Reconfigure the BIOS so that the USB and NIC, or the USB and HBA, are not sharing the same IRQ. The impact of this workaround is that iLO (HP) and DRAC (Dell) also get disabled. RSA (IBM) might be affected by this action.

  4. Reconfigure the system so that the USB and NIC, or the USB and HBA, are not sharing the same IRQ. This can be accomplished through a number of means (changing the physical card location, removing unneeded cards, and so on).

  5. Disable the USB interface at the BIOS. The impact of this workaround is that remote management console, iLO (HP) and DRAC (Dell), as well as other USB devices, also get disabled. RSA (IBM) might be affected by this action.

Permanent Solution

The fix resolves a deadlock between two requests serviced by VMkernel on behalf of the console OS, which leads to the ESX Server host not responding. The fix is released in ESX 3.0.1 patch ESX-1000039. See http://kb.vmware.com/kb/1000039, or download the patch directly from http://www.vmware.com/support/vi3/doc/esx-1000039-patch.html.

Feedback

Rating: 1 - Lowest 2 3 4 5 - Highest (1 Ratings)   

Did this article help you?
This article resolved my issue.
This article did not resolve my issue.
This article helped but additional information was required to resolve my issue.
What can we do to improve this information? (2000 or fewer characters)
Submit
Rating: 1 - Lowest 2 3 4 5 - Highest (1 Ratings)   
Actions