0

Cluster information:

Kubernetes version: v1.28.2
Cloud being used: Virtualbox
Installation method: Kubernetes Cluster VirtualBox
Host OS: Ubuntu 22.04.3 LTS
CNI and version: calico
CRI and version: containerd://1.7.2

Cluster contains with 1 Master node and 2 Worker nodes. Once cluster is started for a moment (matter of 1-2 minutes since startup) looks good:

lab@master:~$ kubectl -nkube-system get po -o wide
NAME                                       READY   STATUS             RESTARTS          AGE     IP              NODE       NOMINATED NODE   READINESS GATES
calico-kube-controllers-7ddc4f45bc-4qx7l   1/1     Running            12 (2m11s ago)    13d     10.10.219.98    master     <none>           <none>
calico-node-bqlnm                          1/1     Running            3 (2m11s ago)     4d2h    192.168.1.164   master     <none>           <none>
calico-node-mrd86                          1/1     Running            105 (2d20h ago)   4d2h    192.168.1.165   worker01   <none>           <none>
calico-node-r6w9s                          1/1     Running            110 (2d20h ago)   4d2h    192.168.1.166   worker02   <none>           <none>
coredns-5dd5756b68-njtpf                   1/1     Running            11 (2m11s ago)    13d     10.10.219.100   master     <none>           <none>
coredns-5dd5756b68-pxn8l                   1/1     Running            10 (2m11s ago)    13d     10.10.219.99    master     <none>           <none>
etcd-master                                1/1     Running            67 (2m11s ago)    13d     192.168.1.164   master     <none>           <none>
kube-apiserver-master                      1/1     Running            43 (2m11s ago)    13d     192.168.1.164   master     <none>           <none>
kube-controller-manager-master             1/1     Running            47 (2m11s ago)    13d     192.168.1.164   master     <none>           <none>
kube-proxy-ffnzb                           1/1     Running            122 (95s ago)     12d     192.168.1.165   worker01   <none>           <none>
kube-proxy-hf4mx                           1/1     Running            108 (78s ago)     12d     192.168.1.166   worker02   <none>           <none>
kube-proxy-ql576                           1/1     Running            15 (2m11s ago)    13d     192.168.1.164   master     <none>           <none>
kube-scheduler-master                      1/1     Running            46 (2m11s ago)    13d     192.168.1.164   master     <none>           <none>
metrics-server-54cb77cffd-q292x            0/1     CrashLoopBackOff   68 (18s ago)      3d21h   10.10.30.94     worker02   <none>           <none>

However, after some minutes later, pods in kube-system namespace start flapping/crashing.

lab@master:~$ kubectl -nkube-system get po
NAME                                       READY   STATUS             RESTARTS          AGE
calico-kube-controllers-7ddc4f45bc-4qx7l   1/1     Running            12 (19m ago)      13d
calico-node-bqlnm                          0/1     Running            3 (19m ago)       4d2h
calico-node-mrd86                          0/1     CrashLoopBackOff   111 (2m28s ago)   4d2h
calico-node-r6w9s                          0/1     CrashLoopBackOff   116 (2m15s ago)   4d2h
coredns-5dd5756b68-njtpf                   1/1     Running            11 (19m ago)      13d
coredns-5dd5756b68-pxn8l                   1/1     Running            10 (19m ago)      13d
etcd-master                                1/1     Running            67 (19m ago)      13d
kube-apiserver-master                      1/1     Running            43 (19m ago)      13d
kube-controller-manager-master             1/1     Running            47 (19m ago)      13d
kube-proxy-ffnzb                           0/1     CrashLoopBackOff   127 (42s ago)     12d
kube-proxy-hf4mx                           0/1     CrashLoopBackOff   113 (2m17s ago)   12d
kube-proxy-ql576                           1/1     Running            15 (19m ago)      13d
kube-scheduler-master                      1/1     Running            46 (19m ago)      13d
metrics-server-54cb77cffd-q292x            0/1     CrashLoopBackOff   73 (64s ago)      3d22h

It is completely unclear to me what is wrong and by checking pods description I see repeating events:

lab@master:~$ kubectl -nkube-system logs kube-proxy-ffnzb
.
.
.
Events:
  Type     Reason          Age                      From     Message
  ----     ------          ----                     ----     -------
  Normal   Killing         2d20h (x50 over 3d1h)    kubelet  Stopping container kube-proxy
  Warning  BackOff         2d20h (x1146 over 3d1h)  kubelet  Back-off restarting failed container kube-proxy in pod kube-proxy-ffnzb_kube-system(79f808ba-f450-4103-80a9-0e75af2e77cf)
  Normal   Pulled          8m11s (x3 over 10m)      kubelet  Container image "registry.k8s.io/kube-proxy:v1.28.6" already present on machine
  Normal   Created         8m10s (x3 over 10m)      kubelet  Created container kube-proxy
  Normal   Started         8m10s (x3 over 10m)      kubelet  Started container kube-proxy
  Normal   SandboxChanged  6m56s (x4 over 10m)      kubelet  Pod sandbox changed, it will be killed and re-created.
  Normal   Killing         4m41s (x4 over 10m)      kubelet  Stopping container kube-proxy
  Warning  BackOff         12s (x28 over 10m)       kubelet  Back-off restarting failed container kube-proxy in pod kube-proxy-ffnzb_kube-system(79f808ba-f450-4103-80a9-0e75af2e77cf)

Note! This situation does not prevent me to deploy some example deployments (nginx) - it seems to be running stable. Yet, I tried to add metrics-server and this one is crashing (possibly it is related to CrashLoopBackOff pods in kube-system namespace)

Any ideas what might be wrong/where else to look to troubleshoot?

1 Answers1

0

I was given a hint by someone to check SystemdCgroup in containerd config file. Following this link.

In my case it turned out I was missing: /etc/containerd/config.toml on Master node.

  • To generate it:
    sudo containerd config default | sudo tee /etc/containerd/config.toml
    
  • Next change SystemdCgroup = true in /etc/containerd/config.toml
  • Restart containerd service:
    systemctl restart containerd
    

That however, put my cluster in following state:

lab@master:~$ kubectl -nkube-system get po
The connection to the server master:6443 was refused - did you specify the right host or port?
lab@master:~$ kubectl get nodes
The connection to the server master:6443 was refused - did you specify the right host or port?

I have reverted it back to false and restarted containerd. However, on Worker nodes I keep it as true.

That fixed the problem:

lab@master:~$ kubectl -nkube-system get po -o wide
NAME                                       READY   STATUS    RESTARTS       AGE    IP              NODE       NOMINATED NODE   READINESS GATES
calico-kube-controllers-7ddc4f45bc-4qx7l   1/1     Running   8 (18m ago)    14d    10.10.219.86    master     <none>           <none>
calico-node-c4rxp                          1/1     Running   7 (14m ago)    89m    192.168.1.166   worker02   <none>           <none>
calico-node-dhzr8                          1/1     Running   7 (18m ago)    14d    192.168.1.164   master     <none>           <none>
calico-node-wqv8w                          1/1     Running   1 (14m ago)    27m    192.168.1.165   worker01   <none>           <none>
coredns-5dd5756b68-njtpf                   1/1     Running   7 (18m ago)    14d    10.10.219.88    master     <none>           <none>
coredns-5dd5756b68-pxn8l                   1/1     Running   6 (18m ago)    14d    10.10.219.87    master     <none>           <none>
etcd-master                                1/1     Running   62 (18m ago)   14d    192.168.1.164   master     <none>           <none>
kube-apiserver-master                      1/1     Running   38 (18m ago)   14d    192.168.1.164   master     <none>           <none>
kube-controller-manager-master             1/1     Running   42 (18m ago)   14d    192.168.1.164   master     <none>           <none>
kube-proxy-mgsdr                           1/1     Running   7 (14m ago)    89m    192.168.1.166   worker02   <none>           <none>
kube-proxy-ql576                           1/1     Running   10 (18m ago)   14d    192.168.1.164   master     <none>           <none>
kube-proxy-zl68t                           1/1     Running   8 (14m ago)    106m   192.168.1.165   worker01   <none>           <none>
kube-scheduler-master                      1/1     Running   41 (18m ago)   14d    192.168.1.164   master     <none>           <none>
metrics-server-98bc7f888-xtdxd             1/1     Running   7 (14m ago)    99m    10.10.5.8       worker01   <none>           <none>

Side note: I also disabled apparmor (master and workers):

sudo systemctl stop apparmor && sudo systemctl disable apparmor