Choosing Master and Worker Nodes for Production Kubernetes

Use dedicated, high-availability master nodes with isolated resources and standardized worker nodes sized for workload demands.

JR

2 minute read

Use dedicated, high-availability master nodes with isolated resources and standardized worker nodes sized for workload demands.

Master Nodes: Stability Over Convenience

Master nodes (API server, etcd, controller manager) demand dedicated resources. Avoid collocating with workloads—this isn’t theoretical; I’ve seen clusters destabilized by noisy neighbors during peak loads.

Action Steps:

  1. Deploy at least three master nodes for HA (odd number for etcd quorum).
  2. Use static IPs and FQDNs for all master components.
  3. Isolate master traffic with network policies (e.g., block non-API ports).

Policy Example:

apiVersion: v1
kind: ConfigMap
metadata:
  name: master-node-policy
data:
  taints: "node-role.kubernetes.io/master:NoSchedule"

Apply this taint to prevent accidental workloads on masters.

Tooling:

  • kubeadm: For bootstrapping self-managed clusters with clear separation.
  • Cloud Provider APIs: AWS EKS, GCP GKE, or Azure AKS handle masters as a managed service (tradeoff: less control).
  • Prometheus + Grafana: Monitor master component health (etcd latency, API server errors).

Tradeoff: Dedicated masters increase cost but reduce blast radius during failures. Managed services reduce operational burden but may limit customization.

Troubleshooting:

  • etcd Issues: Check logs for leader elections or network partitions. Use etcdctl to verify cluster health.
  • API Server Downtime: Rotate certificates proactively; expired certs have tanked clusters during peak hours.

Worker Nodes: Fit for Purpose

Worker nodes run your apps—size them for the actual workloads, not theoretical maxima.

Action Steps:

  1. Profile application resource usage (CPU, memory, storage I/O).
  2. Use node pools for different workloads (e.g., GPU nodes for ML, standard nodes for web apps).
  3. Enable auto-scaling but set realistic bounds (too aggressive = thrash, too conservative = waste).

Policy Example:

apiVersion: autoscaling/v1
kind: HorizontalPodAutoscaler
metadata:
  name: web-app-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: web-app
  minReplicas: 3
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 80

Tooling:

  • Karpenter or Cluster Autoscaler: Dynamically adjust node count based on pod requirements.
  • Node Conformance Testing: Use kube-burner to simulate workloads and validate node performance.

Tradeoff: Over-provisioning workers wastes money; under-provisioning causes evictions and OOM kills. Balance with real metrics.

Troubleshooting:

  • Node Not Ready: Check cloud provider console for stopped/terminated instances.
  • Pod Schedule Failures: Run kubectl describe nodes to inspect resource pressure or taints.

Final Checklist

  • Masters: HA, isolated, monitored.
  • Workers: Right-sized, auto-scaling, node pools for affinity/anti-affinity needs.
  • Drain nodes during upgrades with kubectl drain --ignore-daemonsets --delete-emptydir-data.

No single stack fits all, but these patterns have kept clusters stable under 10k+ node fleets. Adjust based on your team’s capacity and workload reality.

Source thread: What do you use for Master and Workers?

comments powered by Disqus