An Enterprise PKS cluster is in a failed state when the ncp process is failed on a primary node
search cancel

An Enterprise PKS cluster is in a failed state when the ncp process is failed on a primary node

book

Article ID: 319528

calendar_today

Updated On:

Products

VMware

Issue/Introduction

Symptoms:
  • You see output similar to the following when running the monit summary command on a primary node: 

Process 'kube-apiserver'            running
Process 'kube-controller-manager'   running
Process 'kube-scheduler'            running
Process 'etcd'                      running
Process 'blackbox'                  running
Process 'ncp'                       Does not exist

  • You see messages similar to the following in the/var/vcap/sys/log/ncp/ncp.stdout.log file:

2019-04-26T11:08:53.990Z c431a4c7-992b-4d9d-815c-375e3984a24b NSX 8550 - [nsx@6876 comp="nsx-container-ncp" subcomp="ncp" level="ERROR" errorCode="NCP00007"] nsx_ujo.common.utils NSX configuration error: [u'IP block 16fc3f03-ae64-4c7e-8031-cd560a071184 overlaps with IP space 16fc3f03-ae64-4c7e-8031-cd560a071184']
2019-04-26T11:08:53.990Z c431a4c7-992b-4d9d-815c-375e3984a24b NSX 8550 - [nsx@6876 comp="nsx-container-ncp" subcomp="ncp" level="CRITICAL" security="True" errorCode="NCP00001"] nsx_ujo.ncp.main NSX configuration validation failed


Note: The preceding log excerpts are only examples. Date, time, and environmental variables may vary depending on your environment.


Environment

VMware PKS 1.x

Cause

This error is generated when an Owned IpBlock (container ip block) conflicts with an external block (for routable pods). 

This can be validated by checking the ncp.ini file on the primary VM, per the following process:

cat /var/vcap/jobs/ncp/config/ncp.ini 

Note: You will see output similar to the following:

tier0_router = 16963613-6961-4c68-987f-3bc5b9a5725f
lb_service = lb-pks-d80ed3f2-44a3-4e9c-a1ee-ae477092a6e4
no_snat_ip_blocks = 16fc3f03-ae64-4c7e-8031-cd560a071184
external_ip_pools = bbb1813b-5ee2-4c4d-a42f-c2e303fa7dca
election_profile = election-profile-pks-d80ed3f2-44a3-4e9c-a1ee-ae477092a6e4


In ncp.ini there is no entry for container ip blocks and IpBlock IP Block UUID (in the error message from the Symptoms section, the UUID is 16fc3f03-ae64-4c7e-8031-cd560a071184is configured as no-snat block which means it is part of the external_ip_blocks. 
NCP is considering IP Block UUID block as an owned IpBlock. 

Resolution

Verify that the IP Block UUID block has a cluster tag on it in NSX-T manager from the Networking --> IPAM --> IP Blocks --> Overview --> Tags page.

Remove the tag in NSX-T manager and restart NCP on the primary node by issuing the following commands:

monit stop ncp
monit start ncp