Network Problems in Production Kubernetes

Network issues are common in production Kubernetes, often stemming from CoreDNS, NetworkPolicy misconfigurations.

May 13, 2026 JR

2 minute read

Network issues are common in production Kubernetes, often stemming from CoreDNS, NetworkPolicy misconfigurations, or MTU mismatches, requiring systematic diagnosis and proactive policies.

Common Failure Modes

Network problems consistently rank as the second most frequent incident type after resource constraints. Key culprits include:

CoreDNS health: Pod crashes, DNS cache exhaustion, or misconfigured forwarders.
NetworkPolicy gaps: Overly permissive or restrictive rules blocking legitimate traffic.
Endpoint population failures: Service selectors mismatching pod labels, leaving endpoints empty.
MTU mismatches: Overlay networks (e.g., Flannel, Calico) with incorrect MTU settings causing packet fragmentation.
kube-proxy or CNI bugs: Misconfigured iptables rules or broken network plugin integrations.

Actionable Workflow

Isolate the layer:
- Check pod-to-pod connectivity with ping and tcpdump in a debug container.
- Validate DNS resolution: kubectl run -it --rm debug --image=busybox nslookup <service-name>.
- Inspect service endpoints: kubectl get endpoints <service-name>.
Audit NetworkPolicy:
- List active policies: kubectl get networkpolicies --all-namespaces.
- Test traffic flows with netshoot or goldpinger.
Check MTU settings:
- Compare node and pod interface MTU: ip link show (nodes) vs. kubectl exec <pod> -- ip link show.
- Look for ICMP “fragmentation needed” errors in logs.
Validate kube-proxy:
- Ensure modes match CNI (e.g., iptables vs. IPVS).
- Check node-level iptables rules: iptables-save | grep <service-port>.

Policy Example: NetworkPolicy for Default Deny

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny
spec:
  podSelector: {}
  policyTypes:
  - Ingress
  - Egress

Tradeoff: Default-deny policies improve security but require explicit allow rules for all traffic, increasing operational overhead during rollouts.

Tooling

netshoot: Swiss Army knife for network debugging (tcpdump, curl, nslookup, etc.).
Example: kubectl run netshoot --image=nicolaka/netshoot --rm -it -- /bin/bash.
Hubble: Real-time network flow visibility (Cilium-based).
tcpdump: For packet-level inspection in affected pods or nodes.
goldpinger: Lightweight pod-to-pod connectivity tester.

Troubleshooting Common Pitfalls

Empty endpoints: Verify pod labels match service selectors. Check event logs for admission webhook errors.
DNS failures: Ensure CoreDNS pods are running and Corefile configurations are correct. Test with dig or nslookup.
MTU issues: Symptoms include intermittent connectivity or TCP retransmissions. Fix by aligning MTU across nodes and pods.
kube-proxy misconfigurations: Mismatched modes (e.g., iptables on a node with IPVS configured) break proxying.

Network problems are inevitable in dynamic environments, but a structured approach—combined with proactive policies and the right tools—reduces mean time to repair (MTTR) significantly. Prioritize observability and least-privilege NetworkPolicy rules to balance security and usability.

Source thread: How common are network problems in a real production env? New here.

blog

Home

About

Blog

Projects

Posts

Categories

Contact

Recent Posts

Choosing Master and Worker Nodes for Production Kubernetes

Storage Complexity in Production Kubernetes

K3s Ip Management with Netbird and Tailscale: Practical Setup and Tradeoffs

Managing Oversized Ebs Volumes in Production

Diagnosing Multi-service Failures in Production