VMware PKS cluster creation fails with pods stuck on "ContainerCreating" state

Products

VMware

Issue/Introduction

Symptoms:

You are unable to create PKS cluster as some of the pods are stuck in container creating state.
When you run kubectl get pods -o wide --all-namespaces in the K8S cluster, you see an output similar to:

NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE
kube-system heapster-85647cf566-kjjfc 0/1 ContainerCreating 0 52m <none> 41310f54-4dd0-4a3d-983c-092d4c02f956 <none>
kube-system kube-dns-7559c96fc4-76nqt 3/3 Running 0 52m 172.25.4.2 875b1e03-d2dc-429e-959d-fe05df62542b <none>
kube-system metrics-server-555d98886f-vtxzr 1/1 Running 0 52m 172.25.4.3 0a1cb163-5e90-41c6-9001-54e754c5064a <none>

When you log in to the worker node where the pod creation is failing, you see the hyperbus status is unhealthy.

sudo /var/vcap/jobs/nsx-node-agent/bin/nsxcli -c get node-agent-hyperbus status
HyperBus status: Un Healthy

When you log in to the ESXi host where this worker node is running and verify the hyperbus status, it shows either COMMUNICATION_ERROR or miss_version_handshake

Connect to ESXi as root, and run nsxcli then run the below command:
get hyperbus connection info
VIFID connection status
ID number 169.254.1.35:2345 COMMUNICATION_ERROR

In the ncp.stdout.log on primary node, you see the entries similar to:

2019-01-18T01:28:10.185Z a30b596b-5a44-49b7-ba45-91abf6800543 NSX 19425 - [nsx@6876 comp="nsx-container-ncp" subcomp="ncp" level="INFO"] nsx_ujo.ncp.coe_adaptor Finding the configured NCP adaptor kubernetes
1 2019-01-18T01:28:10.185Z a30b596b-5a44-49b7-ba45-91abf6800543 NSX 19425 - [nsx@6876 comp="nsx-container-ncp" subcomp="ncp" level="WARNING"] vmware_nsxlib.v3.resources Deprecated: resources.LogicalRouter is deprecated. Please use core_resources.NsxLibLogicalRouter instead.
1 2019-01-18T01:28:10.185Z a30b596b-5a44-49b7-ba45-91abf6800543 NSX 19425 - [nsx@6876 comp="nsx-container-ncp" subcomp="ncp" level="INFO"] nsx_ujo.ncp.coe_adaptor Finding the configured NCP adaptor kubernetes
1 2019-01-18T01:28:10.244Z a30b596b-5a44-49b7-ba45-91abf6800543 NSX 19425 - [nsx@6876 comp="nsx-container-ncp" subcomp="ncp" level="INFO"] cli.server.container_cli_server Starting ncp CLI server
1 2019-01-18T01:28:10.245Z a30b596b-5a44-49b7-ba45-91abf6800543 NSX 19425 - [nsx@6876 comp="nsx-container-ncp" subcomp="ncp" level="INFO"] cli.server.container_cli_server Creating control channel for ncp CLI server

In the kubelet.stderr.log on worker node, you see the entries similar to:

2019-01-18T05:32:20.688Z c6f24c69-1296-4205-b811-dd7262f05b2a NSX 16953 - [nsx@6876 comp="nsx-container-node" subcomp="nsx_cni" level="ERROR" errorCode="NCP04004"] __main__ Failed to receive message header from nsx_node_agent
E0118 05:32:20.692163 25550 cni.go:260] Error adding network: Failed to receive message header from nsx_node_agent
E0118 05:32:20.692187 25550 cni.go:228] Error while adding to cni network: Failed to receive message header from nsx_node_agent
W0118 05:32:20.692680 25550 cni.go:243] CNI failed to retrieve network namespace path: Error: No such container: 561aaa5d6534394c17bbab600576328bf757bf236a5504fe59fd360c3876de82

Environment

VMware PKS 1.x

Cause

This issue occurs due to hyperbus communication issues from worker node to ESXi.

Resolution

To resolve this issue:

Log in to ESXi host as root and verify that hyperbus adapter (vmk50) exist by running this command:
esxcfg-vmknic -l
Switch to nsxcli by running the command nsxcli
Verify the hyperbus connection info by running the command:
get hyperbus connection info
Restart the netcpad service if you see the hyperbus status as COMMUNICATION_ERROR or miss_version_handshake:
/etc/init.d/netcpad restart