Virtual machines freeze intermittently or goes unresponsive under heavy I/O load

search cancel

Virtual machines freeze intermittently or goes unresponsive under heavy I/O load

book

Article ID: 327867

calendar_today

Updated On:

Products

VMware

Issue/Introduction

Symptoms:

Running the ps -s | grep vm-name command on the ESXi host running the affected virtual machine returns similar to:

4313969 vmm0: vm-name COSTOP NONE 0-63
4313971 vmm1:vm-name WAIT SCSI 0-63
4313972 4313957 vmx-vthread-5:vm-name WAIT UFUTEX 0-63 /bin/vmx
4314204 4313957 vmx-vthread-6:vm-name WAIT UFUTEX 0-63 /bin/vmx
4314205 4313957 vmx-vthread-7:vm-name WAIT UFUTEX 0-63 /bin/vmx
4314206 4313957 vmx-vthread-8:vm-name WAIT UFUTEX 0-63 /bin/vmx
4314210 4313957 vmx-mks:vm-name WAIT UPOL 0-63 /bin/vmx
4314212 4313957 vmx-svga:vm-name WAIT SEMA 0-63 /bin/vmx
4314214 4313957 vmx-vcpu-0:vm-name COSTOP NONE 0-63 /bin/vmx
4314215 4313957 vmx-vcpu-1:vm-name WAIT SCSI 0-63 /bin/vmx

Note: The vmm1 is blocked on a SCSI call.

You see the error:

Unable to connect to the MKS: Error connecting to /bin/vmx process.

Virtual machines are unreachable over the network.
Virtual machines may report an invalid state.
Virtual machines are unresponsive.

Cause

A virtual machine can be unresponsive when:

Taking quiesced snapshots or using a custom quiescing script.
A heavy I/O load on the ESXi hosts
Storage performance issues at the device, storage pool and/or LUN level.
One of the Virtual Machine Monitor (VMM) thread is blocked on a VSCSI call, the other VMM threads are co-stopped, waiting for the blocked thread to make progress.

Resolution

Caution: Ensure that there are no Snapshot consolidation task or backups are running on the VMs during this time.

To recover the virtual machine from its locked-up state:

Run this command to find the process list for the virtual machine and check the cartel ID:

ps -s | grep vm-name

Note: Refer to the ps -s output mentioned in the cause section of this Knowledge Base article.

Find the vmx-vcpu value that is waiting on SCSI event.

Note: The number in the second column of the output is the cartel ID.

Run kill -18 cartel-ID (SIGCONT) signal to the cartel to continue the process that has stopped.
After running the above steps the virtual machine may need to be reloaded. For more information see Reloading a vmx file without removing the virtual machine from inventory (1026043).

Notes:

Above mentioned steps is a workaround to recover VM from locked-up state.
For more information on SIGCONT, see Sending signal to Processes.

Note: The preceding link was correct as of February 9, 2021. If you find the link is broken, provide feedback and a VMware employee will update the link

Additional Information

I/O 負荷の高い状況で仮想マシンが断続的にフリーズする
虚拟机在繁重的 I/O 负载下间歇性冻结

Feedback

thumb_up Yes

thumb_down No