K3s in Production: Practical Considerations and Outcomes

k3s is viable for lightweight production workloads with proper planning.

JR

2 minute read

k3s is viable for lightweight production workloads with proper planning, though tradeoffs exist in scalability and ecosystem support.

Actionable Workflow for k3s Adoption

  1. Assess Workload Requirements:

    • k3s excels for small teams, edge deployments, or stateless apps with predictable resource needs.
    • Avoid if you need advanced networking (e.g., Cilium), complex storage classes, or large node pools (>20 nodes).
  2. Test in Staging:

    • Deploy a non-critical service (e.g., monitoring stack, CI/CD runners) to validate performance and upgrade paths.
    • Use k3s server --datastore-endpoint to test etcd integration if needed.
  3. Deploy with HA in Mind:

    • For production, run at least 3 server nodes with external etcd (Postgres is not sufficient for HA).
    • Use k3s agent for worker nodes with taints to isolate system pods:
      k3s agent --server <server-ip>:6443 --token <token> --taint node-role.kubernetes.io/control-plane:Effect=NoSchedule  
      
  4. Monitor and Maintain:

    • Enable metrics-server and Prometheus for visibility.
    • Schedule regular snapshot backups with k3s etcd snapshot.
  5. Plan Upgrades:

    • Test version upgrades in staging first. Use k3s upgrade with --drain to cordon nodes safely.

Policy Example: Resource Limits

Enforce resource constraints to prevent noisy neighbors:

apiVersion: v1  
kind: Namespace  
metadata:  
  name: production  
  labels:  
    namespace-resource-limit: "true"  
---  
apiVersion: storage.k8s.io/v1  
kind: StorageClass  
metadata:  
  name: local-storage  
provisioner: kubernetes.io/no-provisioner  
volumeBindingMode: WaitForFirstConsumer  

Tooling

  • k3s: Lightweight binary with built-in Docker or containerd.
  • Helm: For deploying apps (ensure tiller is not used in newer versions).
  • Prometheus/Grafana: Metrics collection via k3s helm install prometheus bitnami/prometheus.
  • Terraform: For provisioning nodes (e.g., AWS EC2 or on-prem VMs).

Tradeoffs and Caveats

  • Scalability: k3s struggles beyond 20 nodes; consider Rancher or upstream K8s for larger clusters.
  • Ecosystem Gaps: Some CRDs or operators (e.g., Istio, ArgoCD) may lack testing on k3s.
  • HA Complexity: External etcd adds operational overhead compared to embedded SQLite (not recommended for prod).

Troubleshooting Common Issues

  • Etcd Performance:

    • Symptoms: High latency, API server timeouts.
    • Fix: Use SSD-backed storage, ensure etcd nodes are isolated from workload traffic.
  • Node Registration Failures:

    • Check journalctl -u k3s-agent for token mismatches or network issues.
    • Validate firewall rules allow 6443/tcp between servers and agents.
  • Pod Evictions:

    • Cause: Resource starvation (common in default settings).
    • Fix: Set --kubelet-arg="eviction-hard=memory.available<5%,nodefs.available<10%" on servers.

Conclusion

k3s works for teams needing a lean, fast setup but requires deliberate planning around its limitations. Prioritize monitoring, backups, and upgrade testing to avoid outages. If your needs grow beyond 20 nodes or require advanced features, migrate to upstream Kubernetes early.

Source thread: Is anyone else using k3s in production and happy about it?

comments powered by Disqus