search cancel

CAPV controller not parsing datacenter correctly in multi-datacenter vSphere environment

book

Article ID: 331352

calendar_today

Updated On:

Products

VMware

Issue/Introduction

Symptoms:
This issue was observed on vSphere 7.0U2, which has multiple datacenters in the inventory. CAPV controller seems to have issues parsing such a configuration.

Resolution

There are a couple of options to work around the issue.

Option 1

Revert back to a previous version of the CAPV controller (v0.7.10 rather than v0.7.11).

Note: This has to be done in both:

The Kind/bootstrap cluster, which allows the TKG management cluster to get deployed to vSphere IAAS,
The CAPV controller on the TKG management cluster to enable it to successfully come online. Since there is no CAPV component on the workload clusters, no additional configuration is required for the successful deployment of workload clusters once the management cluster is up and running.

Option 2

Run the command below on your bootstrap node to downgrade the CAPV controller manager to v0.7.10 for TKG 1.4.0:

sed -i s/v0.7.11_vmware.1/v0.7.10_vmware.1/g' \
~/.config/tanzu/bom/tkg-bom-v1.4.0.yaml

Once complete, re-run tanzu management-cluster create to complete the management cluster provisioning process.

Workaround:

Fixing CAPV in the bootstrap/Kind cluster

There is no editor in the bootstrap Kind cluster that we use for TKG, so we need to install that first before we can edit the CAPV controller.

% docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
3bcaff36587f projects.registry.vmware.com/tkg/kind/node:v1.21.2_vmware.1-v0.8.1 "/usr/local/bin/entr…" 6 seconds ago Up 1 second 127.0.0.1:60798->6443/tcp tkg-kind-c7g021vblarn8hbvdrs0-control-plane

% docker exec -it 3bcaff36587f bash
root@tkg-kind-c7g021vblarn8hbvdrs0-control-plane:/# which vi

root@tkg-kind-c7g021vblarn8hbvdrs0-control-plane:/# apt-get update
root@tkg-kind-c7g021vblarn8hbvdrs0-control-plane:/# apt-get install vim -y

root@tkg-kind-c7g021vblarn8hbvdrs0-control-plane:/# which vi
/usr/bin/vi

Now we can edit the CAPV deployment. and change the image from v0.7.11 (which has the problem) to v0.7.10 (which does not have the issue):

# kubectl get deploy -A
NAMESPACE NAME READY UP-TO-DATE AVAILABLE AGE
capi-kubeadm-bootstrap-system capi-kubeadm-bootstrap-controller-manager 0/1 1 0 13s
capi-kubeadm-control-plane-system capi-kubeadm-control-plane-controller-manager 0/1 1 0 9s
capi-system capi-controller-manager 0/1 1 0 16s
capi-webhook-system capi-controller-manager 0/1 1 0 18s
capi-webhook-system capi-kubeadm-bootstrap-controller-manager 0/1 1 0 15s
capi-webhook-system capi-kubeadm-control-plane-controller-manager 0/1 1 0 11s
capi-webhook-system capv-controller-manager 0/1 1 0 6s
capv-system capv-controller-manager 0/1 1 0 4s
cert-manager cert-manager 1/1 1 1 9m19s
cert-manager cert-manager-cainjector 1/1 1 1 9m19s
cert-manager cert-manager-webhook 1/1 1 1 9m18s
kube-system coredns 2/2 2 2 10m
local-path-storage local-path-provisioner 1/1 1 1 9m56s

# kubectl get pods -A
NAMESPACE NAME READY STATUS RESTARTS AGE
capi-kubeadm-bootstrap-system capi-kubeadm-bootstrap-controller-manager-74844b8c66-rqmn5 2/2 Running 0 7m32s
capi-kubeadm-control-plane-system capi-kubeadm-control-plane-controller-manager-857d687b9d-vs2gh 2/2 Running 0 7m28s
capi-system capi-controller-manager-778bd4dfb9-cm99j 2/2 Running 0 7m36s
capi-webhook-system capi-controller-manager-9995bdc94-5wvs8 2/2 Running 0 7m38s
capi-webhook-system capi-kubeadm-bootstrap-controller-manager-68845b65f8-hv7hk 2/2 Running 0 7m35s
capi-webhook-system capi-kubeadm-control-plane-controller-manager-9847c6747-plvmz 2/2 Running 0 7m31s
capi-webhook-system capv-controller-manager-7c5746b4cd-w6dmj 2/2 Running 0 7m26s
capv-system capv-controller-manager-6b84586c64-gm9s6 2/2 Running 0 7m24s
cert-manager cert-manager-5d667cf4fb-r5rdb 1/1 Running 0 16m
cert-manager cert-manager-cainjector-6967bfb7c5-swxpk 1/1 Running 0 16m
cert-manager cert-manager-webhook-6656fcc868-m4b8x 1/1 Running 0 16m
kube-system coredns-8dcb5c56b-cq667 1/1 Running 0 17m
kube-system coredns-8dcb5c56b-sl7c8 1/1 Running 0 17m
kube-system etcd-tkg-kind-c7g021vblarn8hbvdrs0-control-plane 1/1 Running 0 17m
kube-system kindnet-6kvj9 1/1 Running 0 17m
kube-system kube-apiserver-tkg-kind-c7g021vblarn8hbvdrs0-control-plane 1/1 Running 0 17m
kube-system kube-controller-manager-tkg-kind-c7g021vblarn8hbvdrs0-control-plane 1/1 Running 0 17m
kube-system kube-proxy-8vf2d 1/1 Running 0 17m
kube-system kube-scheduler-tkg-kind-c7g021vblarn8hbvdrs0-control-plane 1/1 Running 0 17m
local-path-storage local-path-provisioner-8b46957d4-96p8d 1/1 Running 0 17m

If you examine the controller logs, you will see that it has not parsed the datacenter setting from the TKG manifest: "unable to find datacenter"

.
.
E0113 10:56:07.641569 1 controller.go:257] controller-runtime/controller "msg"="Reconciler error" "error"="unexpected error while probing vcenter for infrastructure.cluster.x-k8s.io/v1alpha3, Kind=VSphereCluster tkg-system/tkg141mgmt: unable to find datacenter \"\": default datacenter resolves to multiple instances, please specify" "controller"="vspherecluster" "name"="tkg141mgmt" "namespace"="tkg-system"
I0113 10:56:17.882883 1 vspherecluster_controller.go:315] capv-controller-manager/vspherecluster-controller/tkg-system/tkg141mgmt "msg"="Reconciling VSphereCluster"
E0113 10:56:19.362812 1 controller.go:257] controller-runtime/controller "msg"="Reconciler error" "error"="unexpected error while probing vcenter for infrastructure.cluster.x-k8s.io/v1alpha3, Kind=VSphereCluster tkg-system/tkg141mgmt: unable to find datacenter \"\": default datacenter resolves to multiple instances, please specify" "controller"="vspherecluster" "name"="tkg141mgmt" "namespace"="tkg-system"
.
.
When you edit the controller, change the version from v0.7.11 to v0.7.10.

root@tkg-kind-c7g021vblarn8hbvdrs0-control-plane:/# kubectl edit deploy capv-controller-manager -n capv-system

spec:
containers:
- args:
- --secure-listen-address=0.0.0.0:8443
- --upstream=http://127.0.0.1:8080/
- --logtostderr=true
- --v=10
image: projects.registry.vmware.com/tkg/cluster-api/kube-rbac-proxy:v0.8.0_vmware.1
imagePullPolicy: IfNotPresent
name: kube-rbac-proxy
ports:
- containerPort: 8443
name: https
protocol: TCP
resources: {}
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
- args:
- --metrics-addr=127.0.0.1:8080
env:
- name: HTTP_PROXY
- name: HTTPS_PROXY
- name: NO_PROXY
image: projects.registry.vmware.com/tkg/cluster-api/cluster-api-vsphere-controller:v0.7.11_vmware.1
imagePullPolicy: IfNotPresent

The original CAPV controller container will be automatically deleted. Now you should observe the datacenter populating correctly.

root@tkg-kind-c7g021vblarn8hbvdrs0-control-plane:/# kubectl logs capv-controller-manager-587fbf697f-nt7vg -n capv-system manager
.
.
I0113 10:59:25.847763 1 vspheremachine_controller.go:365] capv-controller-manager/vspheremachine-controller/tkg-system/tkg141mgmt-control-plane-mqgxj "msg"="waiting for ready state"
I0113 10:59:25.848275 1 controller.go:281] controller-runtime/controller "msg"="Successfully Reconciled" "controller"="vspheremachine" "name"="tkg141mgmt-control-plane-mqgxj" "namespace"="tkg-system"
I0113 10:59:25.953197 1 session.go:63] capv-controller-manager/vspherevm-controller "msg"="found active cached vSphere client session" "datacenter"="/OCTO-Datacenter-A" "server"="vcsa-06.rainpole.com"

The Kind cluster will now be able to allow to deploy virtual machines to the vSphere IAAS and allow the management cluster to be created.

Fix CAPV in the TKG management cluster

The same issue has to be addressed with the TKG management cluster, not just the bootstrap/Kind cluster.

% tanzu login
? Select a server tkg141mgmt ()
✔ successfully logged in to management cluster using the kubeconfig tkg141mgmt

% tanzu management-cluster get
NAME NAMESPACE STATUS CONTROLPLANE WORKERS KUBERNETES ROLES
tkg141mgmt tkg-system createStalled 0/1 1/1 v1.21.2+vmware.1 management

Details:

NAME READY SEVERITY REASON SINCE MESSAGE
/tkg141mgmt False Info WaitingForControlPlane 89m
├─ClusterInfrastructure - VSphereCluster/tkg141mgmt
├─ControlPlane - KubeadmControlPlane/tkg141mgmt-control-plane
│ └─Machine/tkg141mgmt-control-plane-22rsf False Info WaitingForClusterInfrastructure 89m
└─Workers
└─MachineDeployment/tkg141mgmt-md-0
└─Machine/tkg141mgmt-md-0-7bfd765c9c-pcx8z False Info WaitingForClusterInfrastructure 89m

Providers:

NAMESPACE NAME TYPE PROVIDERNAME VERSION WATCHNAMESPACE
capi-kubeadm-bootstrap-system bootstrap-kubeadm BootstrapProvider kubeadm v0.3.23
capi-kubeadm-control-plane-system control-plane-kubeadm ControlPlaneProvider kubeadm v0.3.23
capi-system cluster-api CoreProvider cluster-api v0.3.23
capv-system infrastructure-vsphere InfrastructureProvider vsphere v0.7.11

Looking at the logs of the CAPV controller in the TKGm management cluster, you see the same issue as before whereby it is not finding a datacenter in the vSphere inventory.

E0113 13:27:46.902732 1 controller.go:257] controller-runtime/controller "msg"="Reconciler error" "error"="unexpected error while probing vcenter for infrastructure.cluster.x-k8s.io/v1alpha3, Kind=VSphereCluster tkg-system/tkg141mgmt: unable to find datacenter \"\": default datacenter resolves to multiple instances, please specify" "controller"="vspherecluster" "name"="tkg141mgmt" "namespace"="tkg-system"
I0113 13:29:04.495248 1 vspherecluster_controller.go:315] capv-controller-manager/vspherecluster-controller/tkg-system/tkg141mgmt "msg"="Reconciling VSphereCluster"
E0113 13:29:04.711969 1 controller.go:257] controller-runtime/controller "msg"="Reconciler error" "error"="unexpected error while probing vcenter for infrastructure.cluster.x-k8s.io/v1alpha3, Kind=VSphereCluster tkg-system/tkg141mgmt: unable to find datacenter \"\": default datacenter resolves to multiple instances, please specify" "controller"="vspherecluster" "name"="tkg141mgmt" "namespace"="tkg-system"

The solution, as it was for the Kind cluster, is to edit the CAPV deployment, and change the image to the old v0.7.10 version.

- args:
- --metrics-addr=127.0.0.1:8080
env:
- name: HTTP_PROXY
- name: HTTPS_PROXY
- name: NO_PROXY
image: projects.registry.vmware.com/tkg/cluster-api/cluster-api-vsphere-controller:v0.7.10_vmware.1
imagePullPolicy: IfNotPresent
livenessProbe:
failureThreshold: 3

This should now allow the TKG management cluster to come up successfully.

% tanzu management-cluster get
NAME NAMESPACE STATUS CONTROLPLANE WORKERS KUBERNETES ROLES
tkg141mgmt tkg-system running 1/1 1/1 v1.21.2+vmware.1 management

Details:

NAME READY SEVERITY REASON SINCE MESSAGE
/tkg141mgmt True 37s
├─ClusterInfrastructure - VSphereCluster/tkg141mgmt True 43s
├─ControlPlane - KubeadmControlPlane/tkg141mgmt-control-plane True 37s
│ └─Machine/tkg141mgmt-control-plane-22rsf True 39s
└─Workers
└─MachineDeployment/tkg141mgmt-md-0
└─Machine/tkg141mgmt-md-0-7bfd765c9c-pcx8z True 2s

Providers:

NAMESPACE NAME TYPE PROVIDERNAME VERSION WATCHNAMESPACE
capi-kubeadm-bootstrap-system bootstrap-kubeadm BootstrapProvider kubeadm v0.3.23
capi-kubeadm-control-plane-system control-plane-kubeadm ControlPlaneProvider kubeadm v0.3.23
capi-system cluster-api CoreProvider cluster-api v0.3.23
capv-system infrastructure-vsphere InfrastructureProvider vsphere v0.7.11

Fixing a timed out Pinniped

One of the issues I noticed with this CAPV bug is that the reconciling of Pinniped (for LDAP integration) could time out, because it's LoadBalancer Ingress did not come online in a timely fashion.

% kubectl get apps -A
NAMESPACE NAME DESCRIPTION SINCE-DEPLOY AGE
tkg-system ako-operator Reconcile succeeded 3m3s 25m
tkg-system antrea Reconcile succeeded 2m48s 25m
tkg-system metrics-server Reconcile succeeded 2m5s 25m
tkg-system pinniped Reconcile failed: Deploying: Error (see .status.usefulErrorMessage for details) 5m24s 25m
tkg-system tanzu-addons-manager Reconcile succeeded 2m33s 26m
tkg-system vsphere-cpi Reconcile succeeded 95s 25m
tkg-system vsphere-csi Reconcile succeeded 3m14s 25m

% tanzu package installed get pinniped -n tkg-system
/ Retrieving installation details for pinniped...
NAME: pinniped
PACKAGE-NAME: pinniped.tanzu.vmware.com
PACKAGE-VERSION:
STATUS: Reconcile failed: Error (see .status.usefulErrorMessage for details)
CONDITIONS: [{ReconcileFailed True Error (see .status.usefulErrorMessage for details)}]
USEFUL-ERROR-MESSAGE: kapp: Error: waiting on reconcile job/pinniped-post-deploy-job (batch/v1) namespace: pinniped-supervisor:
Finished unsuccessfully (Failed with reason BackoffLimitExceeded: Job has reached the specified backoff limit)

To get the apps output for Pinniped to show reconciled, just restart the pinniped post deploy job:

% kubectl delete jobs pinniped-post-deploy-job -n pinniped-supervisor
job.batch "pinniped-post-deploy-job" deleted

% kubectl get jobs -n pinniped-supervisor -w
NAME COMPLETIONS DURATION AGE
pinniped-post-deploy-job 0/1 0s
pinniped-post-deploy-job 0/1 0s
pinniped-post-deploy-job 0/1 0s 0s
pinniped-post-deploy-job 1/1 11s 11s

% kubectl get apps -A
NAMESPACE NAME DESCRIPTION SINCE-DEPLOY AGE
tkg-system ako-operator Reconcile succeeded 30s 125m
tkg-system antrea Reconcile succeeded 55s 125m
tkg-system load-balancer-and-ingress-service Reconcile succeeded 4m39s 32m
tkg-system metrics-server Reconcile succeeded 59s 125m
tkg-system pinniped Reconcile succeeded 40s 125m
tkg-system tanzu-addons-manager Reconcile succeeded 81s 126m
tkg-system vsphere-cpi Reconcile succeeded 18s 125m
tkg-system vsphere-csi Reconcile succeeded 104s 125

Feedback

thumb_up Yes

thumb_down No