Diagnosing Multi-service Failures in Production

Diagnosing cross-service failures in Kubernetes typically takes 15-45 minutes with structured tooling and observability.

May 29, 2026 JR

2 minute read

Diagnosing cross-service failures in Kubernetes typically takes 15-45 minutes with structured tooling and observability, but can escalate without proper practices.

Actionable Workflow

Check monitoring dashboards for error spikes, latency, or downtime (Prometheus/Grafana, Datadog).
Verify recent changes using kubectl get deployments --show-labels and kubectl get configmaps.
Inspect service health:
- kubectl get endpoints --show-labels (check endpoint availability)
- kubectl describe svc <service> (review events and IP assignments)
Aggregate logs via centralized platform (e.g., Loki, ELK) with queries like {app="auth"} AND error=true.
Trace requests using Jaeger/OpenTelemetry to identify latency or failure points.
Test connectivity between services with kubectl exec into pods and curl/nc to affected endpoints.
Escalate to specialized teams (network, DB, platform) if root cause remains unclear.

Policy Example: Dependency Check Protocol

### Dependency Check Runbook  
1. **Validate core dependencies first**:  
   - DNS: `dig +short <service-domain>` (external), `kubectl logs -l k8s-app=coredns`  
   - Service mesh: `kubectl get crds | grep mesh`, check Envoy access logs (`kubectl logs -l app=istio-proxy`)  
   - Databases: `kubectl exec -it <db-pod> -- pg_isready` (Postgres), `mysqladmin ping` (MySQL)  
2. **Bypass internal resolvers** if DNS issues suspected:  
   - `kubectl exec -it <pod> -- curl -v http://<service-ip>:<port>`

Tooling

Metrics: Prometheus, Grafana (prebuilt dashboards for services, mesh, DBs)
Logs: Loki, ELK (structured logging with labels like app, environment)
Traces: Jaeger, OpenTelemetry (require instrumentation but reduce guesswork)
Network: tcpdump, netshoot, niche for pod-level diagnostics
Incident Mgmt: PagerDuty (alert routing), Jira (postmortems)

Tradeoffs and Caveats

Centralized logging/tracing adds complexity and cost; smaller teams may opt for lighter-weight tools.
Service mesh metrics can mask network layer issues (e.g., MTU mismatches, provider outages).
DNS problems often originate outside the cluster (provider outages, stale records), not CoreDNS itself.

Troubleshooting Common Failures

DNS Resolution:
- Check coredns logs: kubectl logs -l k8s-app=coredns
- Validate pod /etc/resolv.conf matches cluster config.
Service Mesh Issues:
- Look for Envoy proxy errors

Source thread: When a customer-facing workflow fails across 5+ services, how long does it actually take your team to figure out where it broke?

blog

Home

About

Blog

Projects

Posts

Categories

Contact

Recent Posts

Istio Sidecar Proxy Capture Scope and Limitations

Validating and Refining Your Kubernetes Study Plan

Production-ready Kubernetes: What Works in Practice

Database Migrations in Kubernetes: Practical Workflow and Policy

Securing Kubernetes Pods: Field-tested Practices for Production