I want to rebuild my current homelab cluster using some better practices, and thought it might be a good idea to document this journey through these posts. Lab 254 is a project where I’ll keep things as automated and simple as possible. The name itself comes from the VLAN range in which I’ll host the cluster. My aim is to keep these posts rather short and to the point, but we’ll see how long I can stick to that.
So, to sketch the current situation: I currently have a homelab running on Talos with Flux, taskfiles, talhelper, … I’m pretty familiar with these tools, but I want to simplify that setup to something I don’t have to put a lot of work into to maintain. Along with that, I want to reduce the number of “supporting” workloads (think load balancers, ingress controllers, …), moving from a Cilium + NGINX Ingress + MetalLB setup to a full Cilium setup.
Setting up Talos with talhelper
I promised to keep things to the point, so let’s dive right in. I started by creating a repository with the following directory structure:
├── .gitignore
├── .taskfiles/
├── README.md
├── talos/
│ ├── clusterconfig/
│ │ ├── .gitignore
│ ├── patches
│ │ ├── controller/
│ │ └── global/
│ └── talconfig.yaml
└── Taskfile.yaml
This repository’s aim is to store everything required to set up the cluster and run workloads in it. Within the talos/ directory, I’ll store everything for Talos to run. The base of the Talos setup is the talconfig.yaml file, in which the cluster’s configuration is defined. I’ve included a sample below, which you should update to match your environment.
# yaml-language-server: $schema=https://raw.githubusercontent.com/budimanjojo/talhelper/master/pkg/config/schemas/talconfig.json
---
talosVersion: v1.11.0
kubernetesVersion: 1.34.0
clusterName: "homelab"
endpoint: https://10.254.254.99:6443
clusterPodNets:
- "10.244.0.0/16"
additionalApiServerCertSans: &sans
- &kubeApiIP "10.254.254.99"
- 127.0.0.1
additionalMachineCertSans: *sans
# Disable built-in Flannel to use Cilium
cniConfig:
name: none
nodes:
# Duplicate the config below for every control plane node you have
- hostname: "control-01"
ipAddress: "10.254.254.100"
installDiskSelector:
size: '> 4GB'
controlPlane: true
networkInterfaces:
- deviceSelector:
hardwareAddr: "aa:bb:cc:dd:ee:ff"
dhcp: false
addresses:
- "10.254.254.100/24"
routes: &routes
- network: 0.0.0.0/0
gateway: "10.254.254.1"
mtu: &mtu 1500
vip: &vip
ip: *kubeApiIP
# Duplicate the config below for every worker node you have
- hostname: "worker-01"
ipAddress: "10.254.254.110"
installDiskSelector:
size: "> 4GB"
controlPlane: false
networkInterfaces:
- deviceSelector:
hardwareAddr: "aa:bb:cc:dd:ee:ff"
dhcp: false
addresses:
- "10.254.254.110/24"
routes: *routes
mtu: *mtu
patches:
- |-
machine:
network:
nameservers:
- 8.8.8.8
- 8.8.4.4
- "@./patches/global/cluster-discovery.yaml"
- "@./patches/global/hostdns.yaml"
- "@./patches/global/kubelet.yaml"
- "@./patches/global/cni.yaml"
controlPlane:
patches:
- "@./patches/controller/api-access.yaml"
As you can see, there are references to patch files at the bottom of the file. These are used to do small changes to the default settings of Talos, if you’re curious about which changes I do, you can find them in my repository.
Talos the GitOps way with talhelper
Talhelper is a tool which can be used to deploy Talos “the GitOps way”. It uses the talconfig.yaml file described in the previous section and generates Talos and machine configurations which can then be applied to the nodes. To generate the configuration files, you can execute the talhelper genconfig command, after which these can be applied using talhelper gencommand apply --node <IP>.
But remembering those commands was a bit too much of a hassle for me, so I created a Taskfile.yaml, a .taskfiles/talos.yaml and a .taskfiles/bootstrap.yaml to make it easier to deploy a cluster.
Taskfile.yaml:
# yaml-language-server: $schema=https://taskfile.dev/schema.json
version: "3"
includes:
talos: .taskfiles/talos.yaml
bootstrap: .taskfiles/bootstrap.yaml
vars:
TALOS_DIR: "{{ .ROOT_DIR }}/talos"
TALOS_CONFIG: "{{ .TALOS_DIR }}/clusterconfig/talosconfig"
KUBE_CONFIG: "{{ .TALOS_DIR }}/clusterconfig/kubeconfig"
tasks:
default:
silent: true
cmds:
- task -l
talos.yaml:
# yaml-language-server: $schema=https://taskfile.dev/schema.json
version: "3"
tasks:
generate-config:
desc: Generate Talos node configuration
dir: "{{ .TALOS_DIR }}"
cmd: talhelper genconfig
preconditions:
- test -f {{.TALOS_DIR}}/talconfig.yaml
- which talhelper
apply-config:
desc: Apply Talos node configuration to a node
dir: "{{ .TALOS_DIR }}"
cmd: talhelper gencommand apply --node {{ .IP }} --extra-flags="--insecure" | bash
requires:
vars:
- IP
preconditions:
- test -f {{ .TALOS_CONFIG }}
- which talhelper
- which talosctl
fetch-kubeconfig:
desc: Fetches the cluster's kubeconfig file
dir: "{{ .TALOS_DIR }}"
cmd: until talhelper gencommand kubeconfig --extra-flags="{{.TALOS_DIR}}/clusterconfig --force" | bash; do sleep 10; done
preconditions:
- which talhelper
- which talosctl
reset-cluster:
desc: Resets nodes in cluster
dir: "{{ .TALOS_DIR }}"
prompt: This will destroy your cluster and reset the nodes back to maintenance mode... continue?
cmd: talhelper gencommand reset --extra-flags="--reboot --system-labels-to-wipe STATE --system-labels-to-wipe EPHEMERAL --graceful=false --wait=false" | bash
preconditions:
- which talhelper
- which talosctl
bootstrap.yaml:
# yaml-language-server: $schema=https://taskfile.dev/schema.json
version: "3"
tasks:
talos:
desc: Bootstrap the Talos cluster
dir: "{{ .TALOS_DIR }}"
cmds:
- until talhelper gencommand bootstrap | bash; do sleep 10; done
preconditions:
- test -f {{ .TALOS_DIR }}/talconfig.yaml
- which talhelper
- which talosctl
Setting up the cluster now goes as follows:
- Update the
talos/talconfig.yamlfile to match your environment - Run
task talos:generate-configto generate the Talos machine configurations - Run
task talos:apply-config IP=<MACHINE_IP>for each machine you want to set up - Wait for the machines to reboot…
- Run
task bootstrap:talosto bootstrap your Talos cluster - Run
task talos:fetch-kubeconfigto retrieve the cluster’s kubeconfig file - From the root of the repo, set the retrieved kubeconfig file as the one
kubectlshould use:export KUBECONFIG=$(pwd)/talos/clusterconfig/kubeconfig
Installing Cilium as CNI
Previously, running Cilium in Talos clusters was a bit of a pain since the L2 load balancing had some issues, but those have now been resolved, which is why I finally got rid of MetalLB. Installing Cilium is pretty simple since you only need to execute two commands:
- Add a label to the kube-system namespace to bypass some pod security policies (not the best solution, but it does the job for now):
kubectl label ns kube-system pod-security.kubernetes.io/enforce=privileged
- Install Cilium using helm:
helm install cilium --namespace kube-system cilium/cilium \
--set ipam.mode=kubernetes \
--set kubeProxyReplacement=true \
--set operator.replicas=1 \
--set securityContext.capabilities.ciliumAgent="{CHOWN,KILL,NET_ADMIN,NET_RAW,IPC_LOCK,SYS_ADMIN,SYS_RESOURCE,DAC_OVERRIDE,FOWNER,SETGID,SETUID}" \
--set securityContext.capabilities.cleanCiliumState="{NET_ADMIN,SYS_ADMIN,SYS_RESOURCE}" \
--set cgroup.autoMount.enabled=false \
--set cgroup.hostRoot=/sys/fs/cgroup \
--set l2announcements.enabled=true \
--set externalIPs.enabled=true \
--set ingressController.enabled=true \
--set ingressController.default=true \
--set k8sServiceHost=localhost \
--set k8sServicePort=7445
Just like with the Talos commands, this is a bit too much effort for me to remember and execute every time I rebuild my cluster (which happened a lot while troubleshooting), so I added a new task for executing these commands. Besides the “Bootstrap the Talos cluster” task, there is now also a “Bootstrap network” task in the bootstrap.yaml file:
network:
desc: Bootstraps the Kubernetes cluster's network
cmds:
- kubectl label ns kube-system pod-security.kubernetes.io/enforce=privileged
- helm --kubeconfig {{ .KUBE_CONFIG }} install cilium --namespace kube-system cilium/cilium --set ipam.mode=kubernetes --set kubeProxyReplacement=true --set operator.replicas=1 --set securityContext.capabilities.ciliumAgent="{CHOWN,KILL,NET_ADMIN,NET_RAW,IPC_LOCK,SYS_ADMIN,SYS_RESOURCE,DAC_OVERRIDE,FOWNER,SETGID,SETUID}" --set securityContext.capabilities.cleanCiliumState="{NET_ADMIN,SYS_ADMIN,SYS_RESOURCE}" --set cgroup.autoMount.enabled=false --set cgroup.hostRoot=/sys/fs/cgroup --set l2announcements.enabled=true --set externalIPs.enabled=true --set ingressController.enabled=true --set ingressController.default=true --set k8sServiceHost=localhost --set k8sServicePort=7445
preconditions:
- test -f {{ .KUBE_CONFIG }}
- which helm
- which kubectl
So, by just executing task bootstrap:network after the Talos cluster has been set up, the Cilium CNI will be installed, after which pods can be deployed.
Next up
Now that the cluster is up and running using Talos and Cilium, the next step is to get Flux set up to automatically install some workloads on the cluster.