Diagnosing Multi-service Failures in Production
Diagnosing cross-service failures in Kubernetes typically takes 15-45 minutes with structured tooling and observability.
Diagnosing cross-service failures in Kubernetes typically takes 15-45 minutes with structured tooling and observability, but can escalate without proper practices.
Actionable Workflow
- Check monitoring dashboards for error spikes, latency, or downtime (Prometheus/Grafana, Datadog).
- Verify recent changes using
kubectl get deployments --show-labelsandkubectl get configmaps. - Inspect service health:
kubectl get endpoints --show-labels(check endpoint availability)kubectl describe svc <service>(review events and IP assignments)
- Aggregate logs via centralized platform (e.g., Loki, ELK) with queries like
{app="auth"} AND error=true. - Trace requests using Jaeger/OpenTelemetry to identify latency or failure points.
- Test connectivity between services with
kubectl execinto pods andcurl/ncto affected endpoints. - Escalate to specialized teams (network, DB, platform) if root cause remains unclear.
Policy Example: Dependency Check Protocol
### Dependency Check Runbook
1. **Validate core dependencies first**:
- DNS: `dig +short <service-domain>` (external), `kubectl logs -l k8s-app=coredns`
- Service mesh: `kubectl get crds | grep mesh`, check Envoy access logs (`kubectl logs -l app=istio-proxy`)
- Databases: `kubectl exec -it <db-pod> -- pg_isready` (Postgres), `mysqladmin ping` (MySQL)
2. **Bypass internal resolvers** if DNS issues suspected:
- `kubectl exec -it <pod> -- curl -v http://<service-ip>:<port>`
Tooling
- Metrics: Prometheus, Grafana (prebuilt dashboards for services, mesh, DBs)
- Logs: Loki, ELK (structured logging with labels like
app,environment) - Traces: Jaeger, OpenTelemetry (require instrumentation but reduce guesswork)
- Network:
tcpdump,netshoot,nichefor pod-level diagnostics - Incident Mgmt: PagerDuty (alert routing), Jira (postmortems)
Tradeoffs and Caveats
- Centralized logging/tracing adds complexity and cost; smaller teams may opt for lighter-weight tools.
- Service mesh metrics can mask network layer issues (e.g., MTU mismatches, provider outages).
- DNS problems often originate outside the cluster (provider outages, stale records), not CoreDNS itself.
Troubleshooting Common Failures
- DNS Resolution:
- Check
corednslogs:kubectl logs -l k8s-app=coredns - Validate pod
/etc/resolv.confmatches cluster config.
- Check
- Service Mesh Issues:
- Look for Envoy proxy errors
Source thread: When a customer-facing workflow fails across 5+ services, how long does it actually take your team to figure out where it broke?

Share this post
Twitter
Google+
Facebook
Reddit
LinkedIn
StumbleUpon
Pinterest
Email