Network Problems in Production Kubernetes
Network issues are common in production Kubernetes, often stemming from CoreDNS, NetworkPolicy misconfigurations.
Network issues are common in production Kubernetes, often stemming from CoreDNS, NetworkPolicy misconfigurations, or MTU mismatches, requiring systematic diagnosis and proactive policies.
Common Failure Modes
Network problems consistently rank as the second most frequent incident type after resource constraints. Key culprits include:
- CoreDNS health: Pod crashes, DNS cache exhaustion, or misconfigured forwarders.
- NetworkPolicy gaps: Overly permissive or restrictive rules blocking legitimate traffic.
- Endpoint population failures: Service selectors mismatching pod labels, leaving endpoints empty.
- MTU mismatches: Overlay networks (e.g., Flannel, Calico) with incorrect MTU settings causing packet fragmentation.
- kube-proxy or CNI bugs: Misconfigured iptables rules or broken network plugin integrations.
Actionable Workflow
- Isolate the layer:
- Check pod-to-pod connectivity with
pingandtcpdumpin a debug container. - Validate DNS resolution:
kubectl run -it --rm debug --image=busybox nslookup <service-name>. - Inspect service endpoints:
kubectl get endpoints <service-name>.
- Check pod-to-pod connectivity with
- Audit NetworkPolicy:
- List active policies:
kubectl get networkpolicies --all-namespaces. - Test traffic flows with
netshootorgoldpinger.
- List active policies:
- Check MTU settings:
- Compare node and pod interface MTU:
ip link show(nodes) vs.kubectl exec <pod> -- ip link show. - Look for ICMP “fragmentation needed” errors in logs.
- Compare node and pod interface MTU:
- Validate kube-proxy:
- Ensure modes match CNI (e.g., iptables vs. IPVS).
- Check node-level iptables rules:
iptables-save | grep <service-port>.
Policy Example: NetworkPolicy for Default Deny
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: default-deny
spec:
podSelector: {}
policyTypes:
- Ingress
- Egress
Tradeoff: Default-deny policies improve security but require explicit allow rules for all traffic, increasing operational overhead during rollouts.
Tooling
- netshoot: Swiss Army knife for network debugging (tcpdump, curl, nslookup, etc.).
Example:kubectl run netshoot --image=nicolaka/netshoot --rm -it -- /bin/bash. - Hubble: Real-time network flow visibility (Cilium-based).
- tcpdump: For packet-level inspection in affected pods or nodes.
- goldpinger: Lightweight pod-to-pod connectivity tester.
Troubleshooting Common Pitfalls
- Empty endpoints: Verify pod labels match service selectors. Check event logs for admission webhook errors.
- DNS failures: Ensure CoreDNS pods are running and
Corefileconfigurations are correct. Test withdigornslookup. - MTU issues: Symptoms include intermittent connectivity or TCP retransmissions. Fix by aligning MTU across nodes and pods.
- kube-proxy misconfigurations: Mismatched modes (e.g., iptables on a node with IPVS configured) break proxying.
Network problems are inevitable in dynamic environments, but a structured approach—combined with proactive policies and the right tools—reduces mean time to repair (MTTR) significantly. Prioritize observability and least-privilege NetworkPolicy rules to balance security and usability.
Source thread: How common are network problems in a real production env? New here.

Share this post
Twitter
Google+
Facebook
Reddit
LinkedIn
StumbleUpon
Pinterest
Email