K8S worker node hangs indefinitely during the PKS Upgrade

Products

VMware

Issue/Introduction

Symptoms:

During the Running errand Upgrade all clusters errand for Pivotal Container Service execution step of the PKS upgrade processs (all versions) the verbose output in Ops Manager shows the following for a long time (hours) with no progress:

2019-03-01 14:41:50 UTC Running "/usr/local/bin/bosh --no-color --non-interactive --tty --environment=10.193.90.11 --deployment=pivotal-container-service-8bc65453a0a0c8a92afe run-errand upgrade-all-service-instances"
Using environment '10.193.90.11' as client 'ops_manager'
Using deployment 'pivotal-container-service-8bc65453a0a0c8a92afe'

Task 12565

Task 12565 | 14:41:51 | Preparing deployment: Preparing deployment (00:00:01)
Task 12565 | 14:41:52 | Running errand: pivotal-container-service/d5bfedbe-ae18-4f39-98f0-4b4c94550979
The bosh tasks output will show two tasks running. The first task is the parent upgrade-all-services-instances errand and the second task is an additional task to create deployment for the service instance deployment of the PKS cluster.

ubuntu@Ops-man-2-3-7:~$ bosh tasks
Using environment '10.193.90.11' as client 'ops_manager'

ID     State       Started At                    Last Activity At              User                                            Deployment                                             Description                                                                                              Result
12566 processing Fri Mar 1 14:41:54 UTC 2019 Fri Mar 1 14:41:54 UTC 2019 pivotal-container-service-8bc65453a0a0c8a92afe service-instance_4b8ad40a-6c1a-4a22-9c3c-1330422ddb81 create deployment                                                                                        -
12565 processing Fri Mar 1 14:41:51 UTC 2019 Fri Mar 1 14:41:51 UTC 2019 ops_manager                                     pivotal-container-service-8bc65453a0a0c8a92afe         run errand upgrade-all-service-instances from deployment pivotal-container-service-8bc65453a0a0c8a92afe -

2 tasks
The BOSH task output will show the redeployment of the service instance of the PKS cluster. In the output you will see that a worker node is hung with Updating worker instance.

ubuntu@Ops-man-2-3-7:~$ bosh task 12566
Using environment '10.193.90.11' as client 'ops_manager'

Task 12566

Task 12566 | 14:41:55 | Preparing deployment: Preparing deployment
Task 12566 | 14:41:56 | Warning: DNS address not available for the link provider instance: pivotal-container-service/d5bfedbe-ae18-4f39-98f0-4b4c94550979
Task 12566 | 14:41:57 | Warning: DNS address not available for the link provider instance: pivotal-container-service/d5bfedbe-ae18-4f39-98f0-4b4c94550979
Task 12566 | 14:41:57 | Warning: DNS address not available for the link provider instance: pivotal-container-service/d5bfedbe-ae18-4f39-98f0-4b4c94550979
Task 12566 | 14:42:08 | Preparing deployment: Preparing deployment (00:00:13)
Task 12566 | 14:43:08 | Preparing package compilation: Finding packages to compile (00:00:00)
Task 12566 | 14:43:08 | Updating instance master: master/71697842-061b-450a-ac0b-73f04012a22a (0) (canary) (00:01:20)
Task 12566 | 14:44:28 | Updating instance master: master/4902c248-fd28-4129-8bca-5094c423fc73 (2) (00:01:09)
Task 12566 | 14:45:37 | Updating instance master: master/de2fda96-f396-4f8d-8bcb-c40306d4d88e (1) (00:01:18)
Task 12566 | 14:46:55 | Updating instance worker: worker/dfa10f94-e690-4249-8463-dc7d9fc3efe6 (0) (canary) (00:01:25)
Task 12566 | 14:48:20 | Updating instance worker: worker/39449084-e393-4fc4-a7b5-1ba613227012 (3)
If you BOSH SSH to the worker node, and check the kubelet drain logs, you can confirm that the kubelet is unable to drain or evict a pod from the worker node.

bosh -d service-instance_4b8ad40a-6c1a-4a22-9c3c-1330422ddb81 ssh worker/39449084-e393-4fc4-a7b5-1ba613227012

worker/39449084-e393-4fc4-a7b5-1ba613227012:~$ sudo -i
worker/39449084-e393-4fc4-a7b5-1ba613227012:~# ps -ef | grep drain
root 18428 724 0 14:48 ? 00:00:00 bash /var/vcap/jobs/kubelet/bin/drain job_changed hash_changed docker
root 18476 18428 0 14:48 ? 00:00:00 kubectl --kubeconfig /var/vcap/jobs/kubelet/config/kubeconfig-drain drain -l bosh.id=39449084-e393-4fc4-a7b5-1ba613227012 --grace-period 10 --force --delete-local-data --ignore-daemonsets

worker/39449084-e393-4fc4-a7b5-1ba613227012:/var/vcap/sys/log/kubelet# ls -l drain.stderr.log
-rw-r--r-- 1 root root 65019 Mar 1 15:24 drain.stderr.log
worker/39449084-e393-4fc4-a7b5-1ba613227012:/var/vcap/sys/log/kubelet# tail -f drain.stderr.log
error when evicting pod "nginx-9cbcd98fd-lb7hj" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
error when evicting pod "nginx-9cbcd98fd-lb7hj" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
error when evicting pod "nginx-9cbcd98fd-lb7hj" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget
You can also use --debug to get more information about the task that is running:
bosh task <task number> --debug |grep INFO

If the details of the offending pod (workload) is not clear from the drain.stderr.log, then these steps can be run to replicate the drain command and retrieve more information:

The following command will highlight the IPs of the Kubernetes nodes:
kubectl get nodes -o wide
NAME                                   STATUS   ROLES    AGE   VERSION   INTERNAL-IP    EXTERNAL-IP    OS-IMAGE             KERNEL-VERSION      CONTAINER-RUNTIME
6d038635-17b5-419b-b94c-f0a72525c66b   Ready    <none>   4d    v1.12.4   10.193.90.98   10.193.90.98   Ubuntu 16.04.5 LTS   4.15.0-42-generic   docker://18.6.1
77c0e5c7-546b-46a7-8e47-8908687980f5   Ready    <none>   4d    v1.12.4   10.193.90.95   10.193.90.95   Ubuntu 16.04.5 LTS   4.15.0-42-generic   docker://18.6.1
7e3ecf57-f2ff-4f69-b503-346cc5c93cea   Ready    <none>   4d    v1.12.4   10.193.90.97   10.193.90.97   Ubuntu 16.04.5 LTS   4.15.0-42-generic   docker://18.6.1
9a107784-4ed5-4557-b6e3-4b43515341b5   Ready    <none>   4d    v1.12.4   10.193.90.96   10.193.90.96   Ubuntu 16.04.5 LTS   4.15.0-42-generic   docker://18.6.1
d5bfc2b0-223f-4cdd-a173-f1acba6fd07a   Ready    <none>   4d    v1.12.4   10.193.90.94   10.193.90.94   Ubuntu 16.04.5 LTS   4.15.0-42-generic   docker://18.6.1
Find the IP that matches the IP of the work node that is hung:
bosh -d service-instance_4b8ad40a-6c1a-4a22-9c3c-1330422ddb81 vms

Deployment 'service-instance_4b8ad40a-6c1a-4a22-9c3c-1330422ddb81'

Instance                                     Process State AZ   IPs           VM CID                                   VM Type      Active
master/4902c248-fd28-4129-8bca-5094c423fc73 running        az1 10.193.90.92 vm-07eb1792-ec2d-44c6-a3a1-ee2c1a98f514 medium.disk true
master/71697842-061b-450a-ac0b-73f04012a22a running        az1 10.193.90.91 vm-cd2918ea-8701-4d2a-87d8-2170f31cf144 medium.disk true
master/de2fda96-f396-4f8d-8bcb-c40306d4d88e running        az1 10.193.90.93 vm-2cdb8541-61c0-4c80-8ae4-97251a1a98fc medium.disk true
worker/39449084-e393-4fc4-a7b5-1ba613227012 running        az1 10.193.90.95 vm-cb68be3e-cf1f-406e-b786-ad2f31f67937 medium.disk true
worker/622db2c3-3c01-4ddd-84a3-9e702dc34e54 running        az1 10.193.90.96 vm-fc6b2f3b-b9fc-42f8-be3d-306a0029aa55 medium.disk true
worker/b026c929-6054-477d-a049-de24ecca0d76 running        az1 10.193.90.97 vm-1ca53832-54bf-4a7e-ac27-20f64ebb3be1 medium.disk true
worker/cae662e4-6380-4126-a981-b1f0e5837952 running       az1 10.193.90.98 vm-1771a9ab-3104-4c4b-b04c-033a8b6ada42 medium.disk true
worker/dfa10f94-e690-4249-8463-dc7d9fc3efe6 running        az1 10.193.90.94 vm-31ef83bd-0ea7-476c-8911-659dd3c584ce medium.disk true
Run the drain command directly from kubectl:
kubectl drain 77c0e5c7-546b-46a7-8e47-8908687980f5 --grace-period 10 --force --delete-local-data --ignore-daemonsets

node/77c0e5c7-546b-46a7-8e47-8908687980f5 already cordoned
WARNING: Ignoring DaemonSet-managed pods: fluent-bit-mbvtg
error when evicting pod "nginx-9cbcd98fd-lb7hj" (will retry after 5s): Cannot evict pod as it
The output confirms the pod, nginx-9cbcd98fd-lb7hj, cannot evict the pod as it would violate the pod's disruption budget.

Environment

VMware PKS 1.x

Cause

During the PKS tile upgrade process, worker nodes are cordoned and drained.This drain is dependent on Kubernetes being able to unschedule all pods. If Kubernetes is unable to unschedule a pod, then the drain hangs indefinitely.
One reason why Kubernetes may be unable to unschedule the node is if the PodDisruptionBudget object has been configured in a way that allows 0 disruptions and only a single instance of the pod has been scheduled.

An Application Owner can create a PodDisruptionBudget object (PDB) for each application. A PDB limits the number pods of a replicated application that are down simultaneously from voluntary disruptions. This is a known issue as the Kubernetes PDB can conflict with a PKS upgrade and prevent the kubelet job from being drained.

Resolution

First see if the PDB can be changed or even deleted to allow the upgrade to continue. If this does not resolve the issue, the following are some possible workarounds:

Configure .spec.replicas to be greater than the PodDisruptionBudget object. When the number of replicas configured in .spec.replicas is greater than the number of replicas set in the PodDisruptionBudget object, disruptions can occur.
For more information, see How Disruption Budgets Work in the Kubernetes documentation. For more information about workload capacity and uptime requirements in PKS, see Prepare to Upgrade in Upgrading PKS.