Pod Restarts: When to Alert and How to Handle Them

Pod restarts can indicate issues but aren’t always critical; alert only when they exceed thresholds or correlate with failures.

JR

2 minute read

Pod restarts can indicate issues but aren’t always critical; alert only when they exceed thresholds or correlate with failures.

Diagnosis First: Not All Restarts Are Equal

Start by distinguishing between intentional restarts (e.g., deployment rollouts) and unplanned ones (OOM kills, app crashes). Use kubectl describe pod <pod-name> to check Events for causes like Error or Warning messages. Run kubectl logs --previous <pod-name> to inspect pre-crash logs.

Actionable Workflow

  1. Check Restart Count:
    kubectl get pods --no-headers | awk '{print $1 " " $3}' | grep -v "0"  
    

    List pods with restarts >0.

  2. Correlate with Metrics:
    Use Prometheus queries like kube_pod_status_restarts_total to identify spikes.
  3. Investigate Root Cause:
    • OOM Kills: Check kubectl describe node <node> for memory pressure.
    • Health Check Failures: Validate readiness/liveness probes.
    • Image Issues: Confirm correct image version and pull permissions.

Alerting Policy Example

Critical Services:

  • Alert if restarts > 5 in 5m and container_status_running == 0 for >2m.
  • Exclude deployments with restartCount increases during rolling updates.

Non-Critical Services:

  • Log restarts but alert only if restarts > 10 in 1h or correlated with error spikes.

Example Prometheus alert:

- alert: HighPodRestarts  
  expr: increase(kube_pod_status_restarts_total[1h]) > 5  
  for: 10m  
  labels:  
    severity: critical  
  annotations:  
    summary: "Pod {{ $labels.pod }} on {{ $labels.node }} has restarted {{ $value }} times in the last hour"  

Tooling

  • Metrics: Prometheus + Alertmanager for restart thresholds.
  • Logs: OpenShift’s EFK stack or Loki to correlate restarts with application errors.
  • Events: Use kubectl get events --field-selector involvedObject.kind=Pod to filter pod-related events.

Tradeoffs and Caveats

  • False Positives: Restart alerts during deployments or node maintenance will noise you. Use exclusion rules.
  • Overhead: Aggressive alerting on all restarts can overwhelm teams; prioritize based on service criticality.
  • Assumption: Assumes restarts are logged and metrics are scraped reliably. Validate your monitoring pipeline.

Troubleshooting Common Failures

  • Image Pull Errors: Check kubectl describe pod for ImagePullBackOff. Verify image name and registry access.
  • OOM Kills: Use kubectl top nodes to identify memory-constrained nodes. Tune resource limits.
  • Health Check Flapping: Test probes locally; ensure application handles readiness probes correctly.

Final Note

Alert on restarts only when they indicate a failure chain. Most clusters see occasional restarts—focus on patterns, not individual events. In my experience, pairing restart metrics with error logs and resource usage gives the clearest signal for intervention.

Source thread: Do you alert on pod restarts, or is that just noise in most clusters?

comments powered by Disqus