Vincent De Borger / Posts / Lab 254 - Cluster setup with Talos & Cilium

Lab 254 - Cluster setup with Talos & Cilium

I want to rebuild my current homelab cluster using some better practices, and thought it might be a good idea to document this journey through these posts. Lab 254 is a project where I’ll keep things as automated and simple as possible. The name itself comes from the VLAN range in which I’ll host the cluster. My aim is to keep these posts rather short and to the point, but we’ll see how long I can stick to that.

So, to sketch the current situation: I currently have a homelab running on Talos with Flux, taskfiles, talhelper, … I’m pretty familiar with these tools, but I want to simplify that setup to something I don’t have to put a lot of work into to maintain. Along with that, I want to reduce the number of “supporting” workloads (think load balancers, ingress controllers, …), moving from a Cilium + NGINX Ingress + MetalLB setup to a full Cilium setup.

Setting up Talos with talhelper

I promised to keep things to the point, so let’s dive right in. I started by creating a repository with the following directory structure:

├── .gitignore
├── .taskfiles/
├── README.md
├── talos/
│   ├── clusterconfig/
│   │   ├── .gitignore
│   ├── patches
│   │   ├── controller/
│   │   └── global/
│   └── talconfig.yaml
└── Taskfile.yaml

This repository’s aim is to store everything required to set up the cluster and run workloads in it. Within the talos/ directory, I’ll store everything for Talos to run. The base of the Talos setup is the talconfig.yaml file, in which the cluster’s configuration is defined. I’ve included a sample below, which you should update to match your environment.

# yaml-language-server: $schema=https://raw.githubusercontent.com/budimanjojo/talhelper/master/pkg/config/schemas/talconfig.json
---
talosVersion: v1.11.0
kubernetesVersion: 1.34.0

clusterName: "homelab"
endpoint: https://10.254.254.99:6443

clusterPodNets: 
  - "10.244.0.0/16"

additionalApiServerCertSans: &sans
  - &kubeApiIP "10.254.254.99"
  - 127.0.0.1
additionalMachineCertSans: *sans 

# Disable built-in Flannel to use Cilium
cniConfig:
  name: none

nodes:
  # Duplicate the config below for every control plane node you have
  - hostname: "control-01"
    ipAddress: "10.254.254.100"
    installDiskSelector:
      size: '> 4GB'
    controlPlane: true
    networkInterfaces:
      - deviceSelector:
          hardwareAddr: "aa:bb:cc:dd:ee:ff"
        dhcp: false
        addresses:
          - "10.254.254.100/24"
        routes: &routes
          - network: 0.0.0.0/0
            gateway: "10.254.254.1"
        mtu: &mtu 1500
        vip: &vip
          ip: *kubeApiIP
  # Duplicate the config below for every worker node you have
  - hostname: "worker-01"
    ipAddress: "10.254.254.110"
    installDiskSelector:
      size: "> 4GB"
    controlPlane: false
    networkInterfaces:
      - deviceSelector:
          hardwareAddr: "aa:bb:cc:dd:ee:ff"
        dhcp: false
        addresses:
          - "10.254.254.110/24"
        routes: *routes
        mtu: *mtu
patches:
  - |-
    machine:
      network:
        nameservers:
          - 8.8.8.8
          - 8.8.4.4
  - "@./patches/global/cluster-discovery.yaml"
  - "@./patches/global/hostdns.yaml"
  - "@./patches/global/kubelet.yaml"
  - "@./patches/global/cni.yaml"
controlPlane:
  patches:
    - "@./patches/controller/api-access.yaml"

As you can see, there are references to patch files at the bottom of the file. These are used to do small changes to the default settings of Talos, if you’re curious about which changes I do, you can find them in my repository.

Talos the GitOps way with talhelper

Talhelper is a tool which can be used to deploy Talos “the GitOps way”. It uses the talconfig.yaml file described in the previous section and generates Talos and machine configurations which can then be applied to the nodes. To generate the configuration files, you can execute the talhelper genconfig command, after which these can be applied using talhelper gencommand apply --node <IP>.

But remembering those commands was a bit too much of a hassle for me, so I created a Taskfile.yaml, a .taskfiles/talos.yaml and a .taskfiles/bootstrap.yaml to make it easier to deploy a cluster.

Taskfile.yaml:

# yaml-language-server: $schema=https://taskfile.dev/schema.json
version: "3"
includes:
  talos: .taskfiles/talos.yaml
  bootstrap: .taskfiles/bootstrap.yaml
vars:
  TALOS_DIR: "{{ .ROOT_DIR }}/talos"
  TALOS_CONFIG: "{{ .TALOS_DIR }}/clusterconfig/talosconfig"
  KUBE_CONFIG: "{{ .TALOS_DIR }}/clusterconfig/kubeconfig"
tasks:
  default:
    silent: true
    cmds:
      - task -l

talos.yaml:

# yaml-language-server: $schema=https://taskfile.dev/schema.json
version: "3"
tasks:
  generate-config:
    desc: Generate Talos node configuration
    dir: "{{ .TALOS_DIR }}"
    cmd: talhelper genconfig
    preconditions:
      - test -f {{.TALOS_DIR}}/talconfig.yaml
      - which talhelper
  apply-config:
    desc: Apply Talos node configuration to a node
    dir: "{{ .TALOS_DIR }}"
    cmd: talhelper gencommand apply --node {{ .IP }} --extra-flags="--insecure" | bash
    requires:
      vars:
        - IP
    preconditions:
      - test -f {{ .TALOS_CONFIG }}
      - which talhelper
      - which talosctl
  fetch-kubeconfig:
    desc: Fetches the cluster's kubeconfig file
    dir: "{{ .TALOS_DIR }}"
    cmd: until talhelper gencommand kubeconfig --extra-flags="{{.TALOS_DIR}}/clusterconfig --force" | bash; do sleep 10; done 
    preconditions:
      - which talhelper
      - which talosctl
  reset-cluster:
    desc: Resets nodes in cluster
    dir: "{{ .TALOS_DIR }}"
    prompt: This will destroy your cluster and reset the nodes back to maintenance mode... continue?
    cmd: talhelper gencommand reset --extra-flags="--reboot --system-labels-to-wipe STATE --system-labels-to-wipe EPHEMERAL --graceful=false --wait=false" | bash
    preconditions:
      - which talhelper
      - which talosctl

bootstrap.yaml:

# yaml-language-server: $schema=https://taskfile.dev/schema.json
version: "3"
tasks:
  talos:
    desc: Bootstrap the Talos cluster
    dir: "{{ .TALOS_DIR }}"
    cmds:
      - until talhelper gencommand bootstrap | bash; do sleep 10; done
    preconditions:
      - test -f {{ .TALOS_DIR }}/talconfig.yaml 
      - which talhelper
      - which talosctl

Setting up the cluster now goes as follows:

  1. Update the talos/talconfig.yaml file to match your environment
  2. Run task talos:generate-config to generate the Talos machine configurations
  3. Run task talos:apply-config IP=<MACHINE_IP> for each machine you want to set up
  4. Wait for the machines to reboot…
  5. Run task bootstrap:talos to bootstrap your Talos cluster
  6. Run task talos:fetch-kubeconfig to retrieve the cluster’s kubeconfig file
  7. From the root of the repo, set the retrieved kubeconfig file as the one kubectl should use: export KUBECONFIG=$(pwd)/talos/clusterconfig/kubeconfig

Installing Cilium as CNI

Previously, running Cilium in Talos clusters was a bit of a pain since the L2 load balancing had some issues, but those have now been resolved, which is why I finally got rid of MetalLB. Installing Cilium is pretty simple since you only need to execute two commands:

  1. Add a label to the kube-system namespace to bypass some pod security policies (not the best solution, but it does the job for now):
kubectl label ns kube-system pod-security.kubernetes.io/enforce=privileged
  1. Install Cilium using helm:
helm install cilium --namespace kube-system cilium/cilium \
--set ipam.mode=kubernetes \
--set kubeProxyReplacement=true \
--set operator.replicas=1 \
--set securityContext.capabilities.ciliumAgent="{CHOWN,KILL,NET_ADMIN,NET_RAW,IPC_LOCK,SYS_ADMIN,SYS_RESOURCE,DAC_OVERRIDE,FOWNER,SETGID,SETUID}" \
--set securityContext.capabilities.cleanCiliumState="{NET_ADMIN,SYS_ADMIN,SYS_RESOURCE}" \
--set cgroup.autoMount.enabled=false \
--set cgroup.hostRoot=/sys/fs/cgroup \
--set l2announcements.enabled=true \
--set externalIPs.enabled=true \
--set ingressController.enabled=true \
--set ingressController.default=true \
--set k8sServiceHost=localhost \
--set k8sServicePort=7445

Just like with the Talos commands, this is a bit too much effort for me to remember and execute every time I rebuild my cluster (which happened a lot while troubleshooting), so I added a new task for executing these commands. Besides the “Bootstrap the Talos cluster” task, there is now also a “Bootstrap network” task in the bootstrap.yaml file:

  network:
    desc: Bootstraps the Kubernetes cluster's network
    cmds:
      - kubectl label ns kube-system pod-security.kubernetes.io/enforce=privileged
      - helm --kubeconfig {{ .KUBE_CONFIG }} install cilium --namespace kube-system cilium/cilium --set ipam.mode=kubernetes --set kubeProxyReplacement=true --set operator.replicas=1 --set securityContext.capabilities.ciliumAgent="{CHOWN,KILL,NET_ADMIN,NET_RAW,IPC_LOCK,SYS_ADMIN,SYS_RESOURCE,DAC_OVERRIDE,FOWNER,SETGID,SETUID}" --set securityContext.capabilities.cleanCiliumState="{NET_ADMIN,SYS_ADMIN,SYS_RESOURCE}" --set cgroup.autoMount.enabled=false --set cgroup.hostRoot=/sys/fs/cgroup --set l2announcements.enabled=true --set externalIPs.enabled=true --set ingressController.enabled=true --set ingressController.default=true --set k8sServiceHost=localhost --set k8sServicePort=7445
    preconditions:
      - test -f {{ .KUBE_CONFIG }}
      - which helm
      - which kubectl

So, by just executing task bootstrap:network after the Talos cluster has been set up, the Cilium CNI will be installed, after which pods can be deployed.

Next up

Now that the cluster is up and running using Talos and Cilium, the next step is to get Flux set up to automatically install some workloads on the cluster.