Common Kubernetes Production Failures and Fixes

Kubernetes production failures often stem from resource exhaustion, misconfigurations, and network bottlenecks.

JR

3 minute read

Kubernetes production failures often stem from resource exhaustion, misconfigurations, and network bottlenecks, requiring proactive monitoring and remediation.

Diagnosis: Where Things Break

  1. Resource Starvation

    • IP exhaustion (e.g., subnet limits, ENI allocation issues)
    • CPU/Memory pressure causing evictions
    • Disk full errors from unbounded logs or images
  2. Network Misconfigurations

    • Subnet CIDR ranges too small for node/pod growth
    • Firewall rules blocking necessary ports (e.g., 6443, 2379-2380)
    • Incorrect DNS configurations breaking service discovery
  3. Node Instability

    • Tainted nodes not scheduling pods
    • Kernel panics or system container failures
    • Inadequate node autoscaling thresholds

Repair Steps: What to Do When It Breaks

  1. Free Up Resources

    • Drain nodes with kubectl drain <node> --ignore-daemonsets --delete-local-data
    • Scale down non-critical workloads to reclaim IPs/CPU
    • For IP exhaustion in EKS: Disable warm ENI allocation:
      aws eks update-kubelet-config --cluster-name <cluster> --nodegroup-name <nodegroup> --kubelet-config-data '{"imageType": "ami-abc123", "systemContainerLog": "true", "enforceImagePolicies": "true", "hairpinMode": "host", "containerLogSize": "10MiB", "containerLogFiles": "5", "streamingConnectionCount": "1024", "fileCheckFrequency": "20s", "kubeAPIQPS": "30", "kubeAPIBurst": "100"}'  
      
  2. Expand Subnet or Migrate

    • Create new subnet with larger CIDR
    • Update node group configurations to use new subnet
    • Open firewall rules for cross-subnet traffic
  3. Fix Node Issues

    • Reboot nodes via cloud provider console or kubectl reboot nodes <node> (if using OS-specific operators)
    • Remove taints blocking scheduling:
      kubectl taint nodes <node> node.kubernetes.io/not-ready:False  
      

Prevention: Stopping Fires Before They Start

  1. Monitor and Alert

    • Track kubelet metrics (e.g., kubelet_pods_status_alloc_id_available)
    • Alert on node resource usage (CPU >80%, memory >90%)
    • Watch for Evicted pod events:
      kubectl get events --field-selector reason=Evicted  
      
  2. Policy Example: Subnet Sizing

    • Enforce minimum /22 CIDR for worker subnets in AWS
    • Use OpenShift’s ClusterNetwork operator to validate CIDR ranges
  3. Capacity Planning

    • Calculate max pods per node: (Subnet Size - Reserved IPs) / Pods per Node
    • Schedule regular reviews with 30% buffer for growth

Tooling: What’s in My Kitbag

  • kubectl: kubectl describe node <node> for resource limits, kubectl logs -f <pod> for container issues
  • Cloud Tools: AWS VPC Flow Logs for network troubleshooting, GCP’s Cloud Console for node health
  • Observability: Prometheus + Grafana for metrics, Weave Scope for network visibility
  • Policy Enforcement: OPA Gatekeeper to block undersized subnets

Tradeoffs and Caveats

  • Disabling Warm ENI: Frees IPs but increases pod startup latency by ~200-500ms. Use only when IP pressure is critical.
  • Over-Provisioning: Larger subnets increase IP waste; balance growth vs. cost.
  • Node Reboots: Mitigate downtime with rolling updates but expect 5-10 minutes per node.

Troubleshooting Common Failures

  1. IP Exhaustion

    • Check AWS ENI limits: aws ec2 describe-account-attributes --attribute-name VpcElasticIpAllocation
    • Verify subnet utilization: aws ec2 describe-subnets --filters "Name=tag:Name,Values=<subnet-name>"
  2. Node Evictions

    • Check node conditions:
      kubectl describe node <node> | grep -A 5 Conditions  
      
    • Look for disk pressure: kubectl describe node <node> | grep -A 5 Allocatable
  3. DNS Failures

    • Test CoreDNS: kubectl exec -it <coredns-pod> -- nslookup kube-dns.default.svc.cluster.local
    • Check kube-proxy logs for IPVS errors:
      kubectl logs -n kube-system <kube-proxy-pod>  
      

When in doubt, start small: fix one node, validate, then scale. Document every workaround—because production amnesia kills teams.

Source thread: What Actually Goes Wrong in Kubernetes Production?

comments powered by Disqus