etcd v3.5.0-3.5.2 can corrupt data in TKG v1.5.0-1.5.3
search cancel

etcd v3.5.0-3.5.2 can corrupt data in TKG v1.5.0-1.5.3

book

Article ID: 331349

calendar_today

Updated On:

Products

VMware

Issue/Introduction

Symptoms:
  • This issue is not common. It may come up when high memory load on control plane nodes causes one of them to shut down.
  • To diagnose this issue, run the following:
kubectl get pods -n kube-system -l component=etcd -o name | xargs -i  -- kubectl -n kube-system exec '{}' -- etcdctl --cacert=/etc/kubernetes/pki/etcd/ca.crt --cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt --key=/etc/kubernetes/pki/etcd/healthcheck-client.key endpoint status -w json

In the JSON output, compare the values of revision and dbSize. If the cluster is more than one hour old and these values differ by more than 10%, then the cluster may be affected.


Cause

TKG versions v1.5.0-1.5.3 use etcd versions v3.5.0-3.5.2, which has a known data inconsistency issue.  The etcd issue is fixed TKG 1.5.4

Resolution

Run the diagnosis under Symptoms above on all of your running clusters. If any of them seem to have been affected, then:

  • If cluster contents are irreplaceable, contact VMware Support. Your data is intact and safe, but it may be at risk if you take any actions to restore the stopped control plane node.
  • Otherwise, rebuild the cluster.

Ensuring that control plane nodes have ample memory can help mitigate this issue.


Workaround:

VMware recommends:

  • If you're using TKG v1.4.x, next path upgrade will be to TKG v1.5.4
  • if you're using TKG v1.5.1/1.5.2/1.5.3, upgrade to TKG v1.5.4