Common Kubernetes Production Failures and Fixes

Kubernetes production failures often stem from resource exhaustion, misconfigurations, and network bottlenecks.

February 19, 2026 JR

3 minute read

Kubernetes production failures often stem from resource exhaustion, misconfigurations, and network bottlenecks, requiring proactive monitoring and remediation.

Diagnosis: Where Things Break

Resource Starvation
- IP exhaustion (e.g., subnet limits, ENI allocation issues)
- CPU/Memory pressure causing evictions
- Disk full errors from unbounded logs or images
Network Misconfigurations
- Subnet CIDR ranges too small for node/pod growth
- Firewall rules blocking necessary ports (e.g., 6443, 2379-2380)
- Incorrect DNS configurations breaking service discovery
Node Instability
- Tainted nodes not scheduling pods
- Kernel panics or system container failures
- Inadequate node autoscaling thresholds

Repair Steps: What to Do When It Breaks

Free Up Resources

Drain nodes with kubectl drain <node> --ignore-daemonsets --delete-local-data
Scale down non-critical workloads to reclaim IPs/CPU

For IP exhaustion in EKS: Disable warm ENI allocation:

aws eks update-kubelet-config --cluster-name <cluster> --nodegroup-name <nodegroup> --kubelet-config-data '{"imageType": "ami-abc123", "systemContainerLog": "true", "enforceImagePolicies": "true", "hairpinMode": "host", "containerLogSize": "10MiB", "containerLogFiles": "5", "streamingConnectionCount": "1024", "fileCheckFrequency": "20s", "kubeAPIQPS": "30", "kubeAPIBurst": "100"}'

Expand Subnet or Migrate
- Create new subnet with larger CIDR
- Update node group configurations to use new subnet
- Open firewall rules for cross-subnet traffic
Fix Node Issues
- Reboot nodes via cloud provider console or kubectl reboot nodes <node> (if using OS-specific operators)
- Remove taints blocking scheduling:
```
kubectl taint nodes <node> node.kubernetes.io/not-ready:False  
```

Prevention: Stopping Fires Before They Start

Monitor and Alert
- Track kubelet metrics (e.g., kubelet_pods_status_alloc_id_available)
- Alert on node resource usage (CPU >80%, memory >90%)
- Watch for Evicted pod events:
```
kubectl get events --field-selector reason=Evicted  
```
Policy Example: Subnet Sizing
- Enforce minimum /22 CIDR for worker subnets in AWS
- Use OpenShift’s ClusterNetwork operator to validate CIDR ranges
Capacity Planning
- Calculate max pods per node: (Subnet Size - Reserved IPs) / Pods per Node
- Schedule regular reviews with 30% buffer for growth

Tooling: What’s in My Kitbag

kubectl: kubectl describe node <node> for resource limits, kubectl logs -f <pod> for container issues
Cloud Tools: AWS VPC Flow Logs for network troubleshooting, GCP’s Cloud Console for node health
Observability: Prometheus + Grafana for metrics, Weave Scope for network visibility
Policy Enforcement: OPA Gatekeeper to block undersized subnets

Tradeoffs and Caveats

Disabling Warm ENI: Frees IPs but increases pod startup latency by ~200-500ms. Use only when IP pressure is critical.
Over-Provisioning: Larger subnets increase IP waste; balance growth vs. cost.
Node Reboots: Mitigate downtime with rolling updates but expect 5-10 minutes per node.

Troubleshooting Common Failures

IP Exhaustion
- Check AWS ENI limits: aws ec2 describe-account-attributes --attribute-name VpcElasticIpAllocation
- Verify subnet utilization: aws ec2 describe-subnets --filters "Name=tag:Name,Values=<subnet-name>"
Node Evictions
- Check node conditions:
```
kubectl describe node <node> | grep -A 5 Conditions  
```
- Look for disk pressure: kubectl describe node <node> | grep -A 5 Allocatable
DNS Failures
- Test CoreDNS: kubectl exec -it <coredns-pod> -- nslookup kube-dns.default.svc.cluster.local
- Check kube-proxy logs for IPVS errors:
```
kubectl logs -n kube-system <kube-proxy-pod>  
```

When in doubt, start small: fix one node, validate, then scale. Document every workaround—because production amnesia kills teams.

Source thread: What Actually Goes Wrong in Kubernetes Production?

blog

Home

About

Blog

Projects

Posts

Categories

Contact

Recent Posts

Prepare for Docker and Kubernetes Live Build Interviews with Production-grade Skills

Cka Remains Relevant for Production Operators

Start Chaos Experiments with Known Weaknesses, Not Random Targets

Infrastructure Deployment in 2026: Practical Scaling and Maintenance

Building Kubernetes Manifests: a Practical Workflow for Production Teams