Search the VMware Knowledge Base (KB)
View by Article ID

Troubleshooting a virtual machine outage across multiple hosts connected to the same array (1003615)

  • 4 Ratings

Symptoms

Outage across the entire ESX environment that causes virtual machines to stop responding.

Resolution

To identify ESX host attached to share storage failures:
  1. Review the vmkernel log (/var/log/vmkernel ).

    At the time of the outage you see the following messages:

    Dec  8 20:06:01 esx013 vmkernel: 29:16:44:10.325 cpu3:1032)SCSI: 3753: AsyncIO timeout (5000); aborting cmd w/ sn 941463, handle 1472/0x40211a28
    Dec  8 20:06:01 esx013 vmkernel: 29:16:44:10.325 cpu3:1032)LinSCSI: 3616: Aborting cmds with world 1024, originHandle 0x40211a28, originSN 941463 from vmhba0:0:5
    Dec  8 20:06:01 esx013 vmkernel: 29:16:44:10.325 cpu3:1032)<6>qla24xx_abort_command(0): handle to abort=857
    Dec  8 20:06:01 esx013 vmkernel: 29:16:44:10.326 cpu3:1032)LinSCSI: 2604: Forcing host status from 2 to SCSI_HOST_OK
    Dec  8 20:06:01 esx013 vmkernel: 29:16:44:10.326 cpu3:1032)LinSCSI: 2606: Forcing device status from SDSTAT_GOOD to SDSTAT_BUSY
    Dec  8 20:06:01 esx013 vmkernel: 29:16:44:10.326 cpu3:1032)SCSI: 3753: AsyncIO timeout (5000); aborting cmd w/ sn 1073299, handle 2415/0x4020c038


    The first line indicates that there was an asynchronous I/O event that timed out. The log indicates where it happened and on which handles. Further analysis shows that this event was seen across many LUNs:

    LinSCSI: 3616: Aborting cmds with world 1024, originHandle 0x40211a28, originSN 941463 from vmhba0:0:5
    LinSCSI: 3616: Aborting cmds with world 1024, originHandle 0x4020c038, originSN 1073299 from vmhba2:0:18
    LinSCSI: 3616: Aborting cmds with world 1024, originHandle 0x40206f58, originSN 15797823 from vmhba2:0:10
    LinSCSI: 3616: Aborting cmds with world 1024, originHandle 0x40206f88, originSN 14484283 from vmhba2:0:8
    LinSCSI: 3616: Aborting cmds with world 1024, originHandle 0x40207628, originSN 4918610 from vmhba2:0:4
    LinSCSI: 3616: Aborting cmds with world 1024, originHandle 0x4020a150, originSN 11539968 from vmhba0:0:21
    LinSCSI: 3616: Aborting cmds with world 1024, originHandle 0x402139b8, originSN 5246133 from vmhba0:0:19
    LinSCSI: 3616: Aborting cmds with world 1024, originHandle 0x402139e8, originSN 1514027 from vmhba2:0:0
    LinSCSI: 3616: Aborting cmds with world 1024, originHandle 0x402127c8, originSN 27207202 from vmhba0:0:3
    LinSCSI: 3616: Aborting cmds with world 1024, originHandle 0x4020bfd8, originSN 6286068 from vmhba2:0:2
    LinSCSI: 3616: Aborting cmds with world 1024, originHandle 0x402123a0, originSN 4414245 from vmhba0:0:17
    LinSCSI: 3616: Aborting cmds with world 1024, originHandle 0x40206fb8, originSN 4404015 from vmhba0:0:11
    LinSCSI: 3616: Aborting cmds with world 1024, originHandle 0x402075f8, originSN 516752 from vmhba2:0:20
    LinSCSI: 3616: Aborting cmds with world 1024, originHandle 0x4020f4c8, originSN 175909 from vmhba0:0:13
    LinSCSI: 3616: Aborting cmds with world 1024, originHandle 0x40206f58, originSN 15797825 from vmhba2:0:10
    LinSCSI: 3616: Aborting cmds with world 1024, originHandle 0x4020c038, originSN 1073300 from vmhba2:0:18
    LinSCSI: 3616: Aborting cmds with world 1024, originHandle 0x40211a28, originSN 941464 from vmhba0:0:5
    LinSCSI: 3616: Aborting cmds with world 1024, originHandle 0x402139e8, originSN 1514028 from vmhba2:0:0
    LinSCSI: 3616: Aborting cmds with world 1024, originHandle 0x402139b8, originSN 5246134 from vmhba0:0:19
    LinSCSI: 3616: Aborting cmds with world 1024, originHandle 0x4020a150, originSN 11539969 from vmhba0:0:21


    This shows that this ESX host is having difficulty communicating with LUNs on target 0.

  2. Confirm that this behavior has been observed on other ESX hosts:

    Dec  8 20:05:58 esx015 vmkernel: 24:03:35:28.836 cpu6:1032)SCSI: 3753: AsyncIO timeout (5000); aborting cmd w/ sn 777397, handle 1286/0x402119f8
    Dec  8 20:05:58 esx015 vmkernel: 24:03:35:28.836 cpu6:1032)LinSCSI: 3616: Aborting cmds with world 1024, originHandle 0x402119f8, originSN 777397 from vmhba0:0:19
    Dec  8 20:05:58 esx015 vmkernel: 24:03:35:28.836 cpu6:1032)<6>qla24xx_abort_command(0): handle to abort=1149
    Dec  8 20:05:58 esx015 vmkernel: 24:03:35:28.837 cpu6:1032)LinSCSI: 2604: Forcing host status from 2 to SCSI_HOST_OK
    Dec  8 20:05:58 esx015 vmkernel: 24:03:35:28.837 cpu6:1032)LinSCSI: 2606: Forcing device status from SDSTAT_GOOD to SDSTAT_BUSY
    Dec  8 20:05:58 esx015 vmkernel: 24:03:35:28.837 cpu6:1032)SCSI: 3753: AsyncIO timeout (5000); aborting cmd w/ sn 1599744, handle 1b82/0x4020a150
    Dec  8 20:05:58 esx015 vmkernel: 24:03:35:28.837 cpu6:1032)LinSCSI: 3616: Aborting cmds with world 1024, originHandle 0x4020a150, originSN 1599744 from vmhba0:0:7


    Dec  8 20:05:59 esx017 vmkernel: 23:11:16:24.526 cpu1:1032)SCSI: 3753: AsyncIO timeout (5000); aborting cmd w/ sn 1403311, handle 148a/0x402119f8
    Dec  8 20:05:59 esx017 vmkernel: 23:11:16:24.526 cpu1:1032)LinSCSI: 3616: Aborting cmds with world 1024, originHandle 0x402119f8, originSN 1403311 from vmhba0:0:15
    Dec  8 20:05:59 esx017 vmkernel: 23:11:16:24.526 cpu1:1032)<6>qla24xx_abort_command(0): handle to abort=1727
    Dec  8 20:05:59 esx017 vmkernel: 23:11:16:24.526 cpu1:1032)LinSCSI: 2604: Forcing host status from 2 to SCSI_HOST_OK
    Dec  8 20:05:59 esx017 vmkernel: 23:11:16:24.526 cpu1:1032)LinSCSI: 2606: Forcing device status from SDSTAT_GOOD to SDSTAT_BUSY
    Dec  8 20:05:59 esx017 vmkernel: 23:11:16:24.526 cpu1:1032)SCSI: 3753: AsyncIO timeout (5000); aborting cmd w/ sn 1195109, handle 15b6/0x4020c038
    Dec  8 20:05:59 esx017 vmkernel: 23:11:16:24.526 cpu1:1032)LinSCSI: 3616: Aborting cmds with world 1024, originHandle 0x4020c038, originSN 1195109 from vmhba0:0:11

    You can see that esx015 and esx017 also observed the same outage and it was at the exact same time. When an outage like this occurs on multiple ESX hosts at the exact same time you can conclude that the root cause is not ESX but the array or possibly the fabric switches.

  3. Review the storage controller or processor logs on the array for any events or messages occurring around the date the instance occurred. There are number of reasons why this kind of outage occurs like, controller problem, failing hard drive, SAN Copy operation being initiated. Also review the switch logs for the same time frame to see if the switches played a factor in this outage.
 
Contact your hardware vendor for more information.

Request a Product Feature

To request a new product feature or to provide feedback on a VMware product, please visit the Request a Product Feature page.

Feedback

  • 4 Ratings

Did this article help you?
This article resolved my issue.
This article did not resolve my issue.
This article helped but additional information was required to resolve my issue.

What can we do to improve this information? (4000 or fewer characters)




Please enter the Captcha code before clicking Submit.
  • 4 Ratings
Actions
KB: