VMware PKS cluster creation fails with pods stuck on "ContainerCreating" state
search cancel

VMware PKS cluster creation fails with pods stuck on "ContainerCreating" state

book

Article ID: 316824

calendar_today

Updated On:

Products

VMware

Issue/Introduction

Symptoms:
  • You are unable to create PKS cluster as some of the pods are stuck in container creating state.

  • When you run kubectl get pods -o wide --all-namespaces in the K8S cluster, you see an output similar to:

NAMESPACE     NAME                              READY   STATUS              RESTARTS   AGE   IP           NODE                                   NOMINATED NODE
kube-system   heapster-85647cf566-kjjfc         0/1     ContainerCreating   0          52m   <none>       41310f54-4dd0-4a3d-983c-092d4c02f956   <none>
kube-system   kube-dns-7559c96fc4-76nqt         3/3     Running             0          52m   172.25.4.2   875b1e03-d2dc-429e-959d-fe05df62542b   <none>
kube-system   metrics-server-555d98886f-vtxzr   1/1     Running             0          52m   172.25.4.3   0a1cb163-5e90-41c6-9001-54e754c5064a   <none>

  • When you log in to the worker node where the pod creation is failing, you see the hyperbus status is unhealthy.

sudo  /var/vcap/jobs/nsx-node-agent/bin/nsxcli -c get node-agent-hyperbus status
HyperBus status: Un Healthy 

  • When you log in to the ESXi host where this worker node is running and verify the hyperbus status, it shows either COMMUNICATION_ERROR or miss_version_handshake

Connect to ESXi as root, and run nsxcli then run the below command:
get hyperbus connection info
VIFID                                    connection                       status
ID number                             169.254.1.35:2345         COMMUNICATION_ERROR

  • In the ncp.stdout.log on primary node, you see the entries similar to:

2019-01-18T01:28:10.185Z a30b596b-5a44-49b7-ba45-91abf6800543 NSX 19425 - [nsx@6876 comp="nsx-container-ncp" subcomp="ncp" level="INFO"] nsx_ujo.ncp.coe_adaptor Finding the configured NCP adaptor kubernetes
1 2019-01-18T01:28:10.185Z a30b596b-5a44-49b7-ba45-91abf6800543 NSX 19425 - [nsx@6876 comp="nsx-container-ncp" subcomp="ncp" level="WARNING"] vmware_nsxlib.v3.resources Deprecated: resources.LogicalRouter is deprecated. Please use core_resources.NsxLibLogicalRouter instead.
1 2019-01-18T01:28:10.185Z a30b596b-5a44-49b7-ba45-91abf6800543 NSX 19425 - [nsx@6876 comp="nsx-container-ncp" subcomp="ncp" level="INFO"] nsx_ujo.ncp.coe_adaptor Finding the configured NCP adaptor kubernetes
1 2019-01-18T01:28:10.244Z a30b596b-5a44-49b7-ba45-91abf6800543 NSX 19425 - [nsx@6876 comp="nsx-container-ncp" subcomp="ncp" level="INFO"] cli.server.container_cli_server Starting ncp CLI server
1 2019-01-18T01:28:10.245Z a30b596b-5a44-49b7-ba45-91abf6800543 NSX 19425 - [nsx@6876 comp="nsx-container-ncp" subcomp="ncp" level="INFO"] cli.server.container_cli_server Creating control channel for ncp CLI server

  • In the kubelet.stderr.log on worker node, you see the entries similar to:

2019-01-18T05:32:20.688Z c6f24c69-1296-4205-b811-dd7262f05b2a NSX 16953 - [nsx@6876 comp="nsx-container-node" subcomp="nsx_cni" level="ERROR" errorCode="NCP04004"] __main__ Failed to receive message header from nsx_node_agent
E0118 05:32:20.692163   25550 cni.go:260] Error adding network: Failed to receive message header from nsx_node_agent
E0118 05:32:20.692187   25550 cni.go:228] Error while adding to cni network: Failed to receive message header from nsx_node_agent
W0118 05:32:20.692680   25550 cni.go:243] CNI failed to retrieve network namespace path: Error: No such container: 561aaa5d6534394c17bbab600576328bf757bf236a5504fe59fd360c3876de82


Environment

VMware PKS 1.x

Cause

This issue occurs due to hyperbus communication issues from worker node to ESXi.

Resolution


To resolve this issue:
  1. Log in to ESXi host as root and verify that hyperbus adapter (vmk50) exist by running this command:
    esxcfg-vmknic -l
  2. Switch to nsxcli by running the command nsxcli
  3. Verify the hyperbus connection info by running the command:
    get hyperbus connection info
  4. Restart the netcpad service if you see the hyperbus status as COMMUNICATION_ERROR or miss_version_handshake:
    /etc/init.d/netcpad restart