PV attachment to a pod fails because of error 'volume' is in use

Products

VMware

Issue/Introduction

Symptoms:
- Pods would be stuck in Init or ContainerCreating states.
- Describe pod shows the error 'volume' is in use:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedAttachVolume 43m attachdetach-controller AttachVolume.Attach failed for volume "pvc-8beebf4e-9a6b-49b4-85e5-7d9d5fd30d2a" : rpc error: code = Internal desc = failed to attach disk: "fafa5903-3a7b-4913-b90e-0ad329ee7d56" with node: "workload-md-1-84959668bd-w74qt" err failed to attach cns volume: "fafa5903-3a7b-4913-b90e-0ad329ee7d56" to node vm: "VirtualMachine:vm-21901 [VirtualCenterHost: ******, UUID: 423f3f41-cb37-facb-d920-98a197347ed8, Datacenter: ***** [Datacenter: Datacenter:datacenter-3, VirtualCenterHost: *******]]". fault: "(*types.LocalizedMethodFault)(0xc000a70c60)({\n DynamicData: (types.DynamicData) {\n },\n Fault: (*types.ResourceInUse)(0xc000ede640)({\n VimFault: (types.VimFault) {\n MethodFault: (types.MethodFault) {\n FaultCause: (*types.LocalizedMethodFault)(<nil>),\n FaultMessage: ([]types.LocalizableMessage) <nil>\n }\n },\n Type: (string) \"\",\n Name: (string) (len=6) \"volume\"\n }),\n LocalizedMessage: (string) (len=32) \"The resource 'volume' is in use.\"\n})\n". opId: "086a91d5"
Warning FailedMount 87s (x1769 over 2d18h) kubelet Unable to attach or mount volumes: unmounted volumes=[data], unattached volumes=[data]: timed out waiting for the condition
- logs of csi-attacher container (part of csi-controller pod) shows the error 'NoPermission':
I1111 04:45:24.730061 1 controller.go:165] Ignoring VolumeAttachment "csi-4954f3188868a22b82bc0ff468e87b9e91148480b59e273c33dad571ee27d279" change
I1111 04:45:24.730066 1 csi_handler.go:624] Saved detach error to "csi-4954f3188868a22b82bc0ff468e87b9e91148480b59e273c33dad571ee27d279"
I1111 04:45:24.730097 1 csi_handler.go:231] Error processing "csi-4954f3188868a22b82bc0ff468e87b9e91148480b59e273c33dad571ee27d279": failed to detach: rpc error: code = Internal desc = queryVolume failed for volumeID: "96848eeb-4f0d-496f-b290-bccc0b60d49c" with err=ServerFaultCode: NoPermission

Cause

- For a pod to successfully have a PV attached to it, the virtual disk (vmdk) has to be mounted to the worker node where the pod is running.
- If the pod stops, the vmdk is detached from the node.
- If the pod starts again (either on the same or another node), the vmdk is attached again to the new node.
- Here, the issue is that the TKG role (that is granted to the TKG user) does not have the required vsphere permissions to perform the vmdk detach action, so the vmdk continues to be attached to the source node.

Resolution

Edit the TKG role to include the Cns.Searchable permission as mentioned in the "Required Permissions for the vSphere Account" section of the documentation:
https://docs.vmware.com/en/VMware-Tanzu-Kubernetes-Grid/1.5/vmware-tanzu-kubernetes-grid-15/GUID-mgmt-clusters-vsphere.html

Additional Information

https://docs.vmware.com/en/VMware-Tanzu-Kubernetes-Grid/1.5/vmware-tanzu-kubernetes-grid-15/GUID-mgmt-clusters-vsphere.html
https://docs.vmware.com/en/VMware-vSphere/7.0/com.vmware.vsphere.storage.doc/GUID-AEB07597-F303-4FDD-87D9-0FDA4836E5BB.html

Impact/Risks:
Any pod that uses a PV that is still attached to a source node will not start, describe pod will show the error "The resource 'volume' is in use."