Search the VMware Knowledge Base (KB)
View by Article ID

SSD log buildup can cause poor performance in a VMware vSAN Cluster (2141386)

  • 1 Ratings

Symptoms

Under certain rare circumstances, vSAN (formerly known as Virtual SAN) can exhibit a behavior where the SSD/cache-tier logging space is filled. When this occurs, it leads to performance impact in the cluster as the SSD is unable to buffer inbound IO in a timely manner.

If this issue is encountered, you experience one or more of these symptoms:

  • Hosts periodically enter in to a not responding state in vCenter Server
  • Some virtual machines resident on vSAN exhibit extremely poor performance
  • Some virtual machines resident on vSAN may fail to power on due to timeout or IO error
There are several ways this issue can manifest. It is most commonly (though not exclusively) associated with messaging regarding persistently high SSD congestion or frequent oscillations in SSD congestion messaging. This information is conveyed in vmkernel.log file on the ESXi vSAN host(s).

Note: SSD congestion messaging is not necessarily associated with the issue described in this document. The presence of SSD congestion messaging is not itself a guarantee that this issue has been encountered.
  • Persistently-high SSD congestion messaging

    2015-10-21T07:05:09.294Z cpu5:33450)LSOM: LSOM_ThrowCongestionVOB:2912: Throttled: Virtual SAN node esxi-01.corp.local maximum SSD 52648428-cfb4-393f-b3cc-d3850b6d0eee congestion reached.
    2015-10-21T07:06:09.408Z cpu14:32817)LSOM: LSOM_ThrowCongestionVOB:2912: Throttled: Virtual SAN node esxi-01.corp.local maximum SSD 52648428-cfb4-393f-b3cc-d3850b6d0eee congestion reached.
    2015-10-21T07:07:09.491Z cpu13:33200)LSOM: LSOM_ThrowCongestionVOB:2912: Throttled: Virtual SAN node esxi-01.corp.local maximum SSD 52648428-cfb4-393f-b3cc-d3850b6d0eee congestion reached.

  • Oscillating SSD congestion messaging

    2015-10-20T05:55:15.773Z cpu34:33120)LSOM: LSOM_ThrowAsyncCongestionVOB:2127: LSOM SSD Congestion State: Normal. Congestion Threshold: 200 Current Congestion: 0.
    2015-10-20T05:55:15.775Z cpu34:33120)LSOM: LSOM_ThrowAsyncCongestionVOB:2127: LSOM SSD Congestion State: Exceeded. Congestion Threshold: 200 Current Congestion: 255.
    2015-10-20T05:55:15.776Z cpu34:33120)LSOM: LSOM_ThrowAsyncCongestionVOB:2127: LSOM SSD Congestion State: Normal. Congestion Threshold: 200 Current Congestion: 0.
    2015-10-20T05:55:15.813Z cpu34:33120)LSOM: LSOM_ThrowAsyncCongestionVOB:2127: LSOM SSD Congestion State: Exceeded. Congestion Threshold: 200 Current Congestion: 255.


Cause

This issue occurs due to a buildup of data in the write buffer of a disk group. This results to buffer exhaustion which ultimately has detrimental effects to IO performance.
 
To prevent exhaustion of the vSAN write buffer of each disk group, the system gradually throttles back the rate of write operations as free buffer space is reduced. This is done by injecting gradually higher latencies to the processing of IO operations of the workloads. An adaptive algorithm is used that prevents overreaction to transient workload spikes by slowly increasing the synthetic delay as the buffer continues to fill. Ultimately, the algorithm ensures that the rate of incoming write operations can be matched by the rate of de-staging data from the buffer to the capacity tier. 

In general circumstances, this mechanism is effective in avoiding buffer exhaustion even for the most write-intensive workloads. However, when the log leak issue is encountered, a number of log records remain in the log (not de-staged) and thus inhibit the effectiveness of the algorithm. As available buffer space is exhausted, the algorithm performs permanent aggressive throttling of inbound workloads for the affected disk groups and their dependent objects. This condition of permanent enforcement causes the extreme performance degradation that is observed.

Resolution

Note: The above-noted symptoms could be related to other issues that can occur. Contact VMware Support to ensure that the log-leak issue is encountered.

This issue is resolved in:

If you can be affected by this issue, engage with VMware Support to confirm the behavior and formulate an action plan. Final identification of this issue can be complex and the action plan for resolution is variable depending on circumstances. Therefore, VMware recommends formal support engagement.


See Also

Update History

01/14/2016 - Added ESXi 6.0 Update 1b fix details.

Request a Product Feature

To request a new product feature or to provide feedback on a VMware product, please visit the Request a Product Feature page.

Feedback

  • 1 Ratings

Did this article help you?
This article resolved my issue.
This article did not resolve my issue.
This article helped but additional information was required to resolve my issue.

What can we do to improve this information? (4000 or fewer characters)




Please enter the Captcha code before clicking Submit.
  • 1 Ratings
Actions
KB: