Common Kubernetes Production Failures and Fixes
Kubernetes production failures often stem from resource exhaustion, misconfigurations, and network bottlenecks.
Kubernetes production failures often stem from resource exhaustion, misconfigurations, and network bottlenecks, requiring proactive monitoring and remediation.
Diagnosis: Where Things Break
-
Resource Starvation
- IP exhaustion (e.g., subnet limits, ENI allocation issues)
- CPU/Memory pressure causing evictions
- Disk full errors from unbounded logs or images
-
Network Misconfigurations
- Subnet CIDR ranges too small for node/pod growth
- Firewall rules blocking necessary ports (e.g., 6443, 2379-2380)
- Incorrect DNS configurations breaking service discovery
-
Node Instability
- Tainted nodes not scheduling pods
- Kernel panics or system container failures
- Inadequate node autoscaling thresholds
Repair Steps: What to Do When It Breaks
-
Free Up Resources
- Drain nodes with
kubectl drain <node> --ignore-daemonsets --delete-local-data - Scale down non-critical workloads to reclaim IPs/CPU
- For IP exhaustion in EKS: Disable warm ENI allocation:
aws eks update-kubelet-config --cluster-name <cluster> --nodegroup-name <nodegroup> --kubelet-config-data '{"imageType": "ami-abc123", "systemContainerLog": "true", "enforceImagePolicies": "true", "hairpinMode": "host", "containerLogSize": "10MiB", "containerLogFiles": "5", "streamingConnectionCount": "1024", "fileCheckFrequency": "20s", "kubeAPIQPS": "30", "kubeAPIBurst": "100"}'
- Drain nodes with
-
Expand Subnet or Migrate
- Create new subnet with larger CIDR
- Update node group configurations to use new subnet
- Open firewall rules for cross-subnet traffic
-
Fix Node Issues
- Reboot nodes via cloud provider console or
kubectl reboot nodes <node>(if using OS-specific operators) - Remove taints blocking scheduling:
kubectl taint nodes <node> node.kubernetes.io/not-ready:False
- Reboot nodes via cloud provider console or
Prevention: Stopping Fires Before They Start
-
Monitor and Alert
- Track
kubeletmetrics (e.g.,kubelet_pods_status_alloc_id_available) - Alert on node resource usage (CPU >80%, memory >90%)
- Watch for
Evictedpod events:kubectl get events --field-selector reason=Evicted
- Track
-
Policy Example: Subnet Sizing
- Enforce minimum /22 CIDR for worker subnets in AWS
- Use OpenShift’s
ClusterNetworkoperator to validate CIDR ranges
-
Capacity Planning
- Calculate max pods per node:
(Subnet Size - Reserved IPs) / Pods per Node - Schedule regular reviews with 30% buffer for growth
- Calculate max pods per node:
Tooling: What’s in My Kitbag
- kubectl:
kubectl describe node <node>for resource limits,kubectl logs -f <pod>for container issues - Cloud Tools: AWS VPC Flow Logs for network troubleshooting, GCP’s Cloud Console for node health
- Observability: Prometheus + Grafana for metrics, Weave Scope for network visibility
- Policy Enforcement: OPA Gatekeeper to block undersized subnets
Tradeoffs and Caveats
- Disabling Warm ENI: Frees IPs but increases pod startup latency by ~200-500ms. Use only when IP pressure is critical.
- Over-Provisioning: Larger subnets increase IP waste; balance growth vs. cost.
- Node Reboots: Mitigate downtime with rolling updates but expect 5-10 minutes per node.
Troubleshooting Common Failures
-
IP Exhaustion
- Check AWS ENI limits:
aws ec2 describe-account-attributes --attribute-name VpcElasticIpAllocation - Verify subnet utilization:
aws ec2 describe-subnets --filters "Name=tag:Name,Values=<subnet-name>"
- Check AWS ENI limits:
-
Node Evictions
- Check node conditions:
kubectl describe node <node> | grep -A 5 Conditions - Look for disk pressure:
kubectl describe node <node> | grep -A 5 Allocatable
- Check node conditions:
-
DNS Failures
- Test CoreDNS:
kubectl exec -it <coredns-pod> -- nslookup kube-dns.default.svc.cluster.local - Check kube-proxy logs for IPVS errors:
kubectl logs -n kube-system <kube-proxy-pod>
- Test CoreDNS:
When in doubt, start small: fix one node, validate, then scale. Document every workaround—because production amnesia kills teams.
Source thread: What Actually Goes Wrong in Kubernetes Production?

Share this post
Twitter
Google+
Facebook
Reddit
LinkedIn
StumbleUpon
Pinterest
Email