Diagnosing Multi-service Failures in Production

Diagnosing cross-service failures in Kubernetes typically takes 15-45 minutes with structured tooling and observability.

JR

2 minute read

Diagnosing cross-service failures in Kubernetes typically takes 15-45 minutes with structured tooling and observability, but can escalate without proper practices.

Actionable Workflow

  1. Check monitoring dashboards for error spikes, latency, or downtime (Prometheus/Grafana, Datadog).
  2. Verify recent changes using kubectl get deployments --show-labels and kubectl get configmaps.
  3. Inspect service health:
    • kubectl get endpoints --show-labels (check endpoint availability)
    • kubectl describe svc <service> (review events and IP assignments)
  4. Aggregate logs via centralized platform (e.g., Loki, ELK) with queries like {app="auth"} AND error=true.
  5. Trace requests using Jaeger/OpenTelemetry to identify latency or failure points.
  6. Test connectivity between services with kubectl exec into pods and curl/nc to affected endpoints.
  7. Escalate to specialized teams (network, DB, platform) if root cause remains unclear.

Policy Example: Dependency Check Protocol

### Dependency Check Runbook  
1. **Validate core dependencies first**:  
   - DNS: `dig +short <service-domain>` (external), `kubectl logs -l k8s-app=coredns`  
   - Service mesh: `kubectl get crds | grep mesh`, check Envoy access logs (`kubectl logs -l app=istio-proxy`)  
   - Databases: `kubectl exec -it <db-pod> -- pg_isready` (Postgres), `mysqladmin ping` (MySQL)  
2. **Bypass internal resolvers** if DNS issues suspected:  
   - `kubectl exec -it <pod> -- curl -v http://<service-ip>:<port>`  

Tooling

  • Metrics: Prometheus, Grafana (prebuilt dashboards for services, mesh, DBs)
  • Logs: Loki, ELK (structured logging with labels like app, environment)
  • Traces: Jaeger, OpenTelemetry (require instrumentation but reduce guesswork)
  • Network: tcpdump, netshoot, niche for pod-level diagnostics
  • Incident Mgmt: PagerDuty (alert routing), Jira (postmortems)

Tradeoffs and Caveats

  • Centralized logging/tracing adds complexity and cost; smaller teams may opt for lighter-weight tools.
  • Service mesh metrics can mask network layer issues (e.g., MTU mismatches, provider outages).
  • DNS problems often originate outside the cluster (provider outages, stale records), not CoreDNS itself.

Troubleshooting Common Failures

  • DNS Resolution:
    • Check coredns logs: kubectl logs -l k8s-app=coredns
    • Validate pod /etc/resolv.conf matches cluster config.
  • Service Mesh Issues:
    • Look for Envoy proxy errors

Source thread: When a customer-facing workflow fails across 5+ services, how long does it actually take your team to figure out where it broke?

comments powered by Disqus