Knowledge Base

Search the Knowledge Base:

Products:
Search In:
 

Virtual machine CPU usage spikes and remains abnormally high after VMotion in a VMware DRS enabled cluster

Details

In a cluster with VMware Distributed Resource Scheduler (DRS) enabled, the CPU usage of a virtual machine may increase significantly after VMotion migrates a virtual machine. As a result, the performance of the virtual machine may be degraded. 
 
Note: This issue is resolved as of VirtualCenter 2.5.0 Update 2.

Solution

Starting with ESX Server 3.5 and VirtualCenter 2.5, VMware DRS applies a cap to the memory overhead of virtual machines to control the growth rate of this memory. This cap is reset to a virtual machine specific computed value after VMotion migrates the virtual machine. Afterwards, if the virtual machine monitor indicates that the virtual machine requires more overhead memory, VMware DRS raises this cap at a controlled rate (1MB per minute, by default) to grant the required memory until the virtual machine overhead memory reaches a steady-state and as long as there are sufficient resources available on the host.

For VirtualCenter 2.5, this cap is not increased to satisfy the virtual machine's steady-state demand as expected. Thus, the virtual machine operates with an overhead memory that is less than its desired size, which in turn may lead to higher observed virtual machine CPU usage and lower virtual machine performance in a VMware DRS-enabled cluster.

Diagnosing the Issue

To diagnose the issue:

  1. Log in to VirtualCenter with Virtual Infrastructure Client as an administrator.
  2. Right-click your cluster from the inventory.
  3. Click Edit Settings.
  4. Disable VMware DRS.
  5. Click OK and wait for 1 minute.
  6. In the Virtual Infrastructure Client, note the virtual machine's CPU usage from Performance tab and the virtual machine's memory overhead from the Summary tab.
  7. Right-click your cluster from the inventory.
  8. Click Edit Settings.
  9. Re-enable VMware DRS.
  10. Use VMotion to migrate a problematic virtual machine to another host.
  11. Note the virtual machine CPU usage and memory overhead on the new host.
  12. Disable VMware DRS on the cluster again, as noted above and wait for 1 minute.
  13. Note the virtual machine CPU usage and memory overhead on the new host.

If the CPU usage of the virtual machine increases in step 11 in comparison to step 6, and decreases back to the original state (similar to the behavior in step 6) in step 13 with an observable increase in the overhead memory, this indicates the issue discussed in this article.

You do not need to disable DRS to work around this issue.

Working around the issue prior to VirtualCenter 2.5 Update 1

To work around this issue:
  1. Log in to VirtualCenter with Virtual Infrastructure Client as an administrator.
  2. Right-click your cluster from the inventory.
  3. Click Edit Settings.
  4. Ensure that VMware DRS is shown as enabled. If it is not enabled check the box to enable VMware DRS.
  5. Click OK.
  6. Click an ESX Server from the Inventory.
  7. Click the Configuration tab.
  8. Click Advanced Settings.
  9. Click the Mem option.
  10. Locate the Mem.VMOverheadGrowthLimit parameter.
  11. Change the value of this parameter to 5 and click OK.

    Note: By default this setting is set to -1.


To fix multiple ESX Server hosts

If this parameter needs to be changed on several hosts (or if the workaround fails for the individual host) use the following procedure to implement the workaround instead of changing every server individually:
  1. Log on to the VirtualCenter Server Console as an administrator.
  2. Make a backup copy of the vpxd.cfg file (typically it is located in C:\Documents and Settings\All Users\Application Data\VMware\VMware VirtualCenter\vpxd.cfg ).
  3. In the vpxd.cfg file, add the following configuration between the <vpxd> and the </vpxd> tags:

    <cluster>

    <VMOverheadGrowthLimit>5</VMOverheadGrowthLimit>

    </cluster>

    This configuration provides an initial growth margin in MB-to-virtual machine overhead memory. You can increase this amount to larger values if doing so further improves virtual machine performance.
  4. Restart the VMware VirtualCenter Server Service.

    Note: When you restart the VMware VirtualCenter Server Service, the new value for the overhead limit should be pushed down to all the clusters in VirtualCenter.
In the case that the new values are not pushed down to ESX hosts within 10 minutes:
  1. Log in to VirtualCenter with Virtual Infrastructure Client as an administrator.
  2. Right-click your cluster from the inventory.
  3. Click Edit Settings.
  4. Disable VMware DRS.
  5. Click OK. Wait for the DRS-disable task to complete.
  6. Right-click your cluster from the inventory.
  7. Click Edit Settings.
  8. Enable VMware DRS.
  9. Click OK.

Working around the issue if it persists after upgrading to VirtualCenter 2.5 Update 1

After applying VirtualCenter 2.5 Update 1, it has been reported that under certain circumstances this behavior may still be seen.
 
To work around the issue:

Note: The aforementioned steps also work, however this method is easier to implement and works for any ESX host that is added to the DRS Cluster.
  1. Log in to VirtualCenter with Virtual Infrastructure Client as an administrator.
  2. Right-click your cluster from the inventory.
  3. Click Edit Settings.
  4. Select VMware DRS (if it is not enabled enable it).
  5. Click the Advanced Options button.
  6. Add MemOverheadGrowth with a value of 4.
  7. Click OK to close out of Advanced Options.
  8. Click OK to close out of the cluster configuration.

A permanent fix for this behavor is included in VirtualCenter 2.5 Update 2.

Verifying the workaround

To verify the setting has taken effect:
  1. Log in to your ESX Server service console as root from either an SSH Session or directly from the console of the server.
  2. Type less /var/log/vmkernel .
A successfully changed setting displays a message similar to the following and no further action is required:
vmkernel: 1:16:23:57.956 cpu3:1036)Config: 414: VMOverheadGrowthLimit" = 5, Old Value: -1, (Status: 0x0)
 
 
If changing the setting was unsuccessful a message similar to the following is displayed:
vmkernel: 1:08:05:22.537 cpu2:1036)Config: 414: "VMOverheadGrowthLimit" = 0, Old Value: -1, (Status: 0x0)
 
 
Note: If you see a message changing the limit to 5 and changing it back to -1, the fix is not successfully applied.

In the case that the fix is unsuccessful attempt the following:
  1. Create a new cluster and move the ESX Server hosts to this cluster.
  2. Check to see if the fix has been implemented successfully.

Feedback

Rating: 1 - Lowest 2 3 4 5 - Highest (2 Ratings)   

Did this article help you?
This article resolved my issue.
This article did not resolve my issue.
This article helped but additional information was required to resolve my issue.
What can we do to improve this information? (2000 or fewer characters)
Submit
Rating: 1 - Lowest 2 3 4 5 - Highest (2 Ratings)   
Actions