Pod Restarts: When to Alert and How to Handle Them
Pod restarts can indicate issues but aren’t always critical; alert only when they exceed thresholds or correlate with failures.
Pod restarts can indicate issues but aren’t always critical; alert only when they exceed thresholds or correlate with failures.
Diagnosis First: Not All Restarts Are Equal
Start by distinguishing between intentional restarts (e.g., deployment rollouts) and unplanned ones (OOM kills, app crashes). Use kubectl describe pod <pod-name> to check Events for causes like Error or Warning messages. Run kubectl logs --previous <pod-name> to inspect pre-crash logs.
Actionable Workflow
- Check Restart Count:
kubectl get pods --no-headers | awk '{print $1 " " $3}' | grep -v "0"List pods with restarts >0.
- Correlate with Metrics:
Use Prometheus queries likekube_pod_status_restarts_totalto identify spikes. - Investigate Root Cause:
- OOM Kills: Check
kubectl describe node <node>for memory pressure. - Health Check Failures: Validate readiness/liveness probes.
- Image Issues: Confirm correct image version and pull permissions.
- OOM Kills: Check
Alerting Policy Example
Critical Services:
- Alert if
restarts > 5in 5m andcontainer_status_running == 0for >2m. - Exclude deployments with
restartCountincreases during rolling updates.
Non-Critical Services:
- Log restarts but alert only if
restarts > 10in 1h or correlated with error spikes.
Example Prometheus alert:
- alert: HighPodRestarts
expr: increase(kube_pod_status_restarts_total[1h]) > 5
for: 10m
labels:
severity: critical
annotations:
summary: "Pod {{ $labels.pod }} on {{ $labels.node }} has restarted {{ $value }} times in the last hour"
Tooling
- Metrics: Prometheus + Alertmanager for restart thresholds.
- Logs: OpenShift’s EFK stack or Loki to correlate restarts with application errors.
- Events: Use
kubectl get events --field-selector involvedObject.kind=Podto filter pod-related events.
Tradeoffs and Caveats
- False Positives: Restart alerts during deployments or node maintenance will noise you. Use exclusion rules.
- Overhead: Aggressive alerting on all restarts can overwhelm teams; prioritize based on service criticality.
- Assumption: Assumes restarts are logged and metrics are scraped reliably. Validate your monitoring pipeline.
Troubleshooting Common Failures
- Image Pull Errors: Check
kubectl describe podforImagePullBackOff. Verify image name and registry access. - OOM Kills: Use
kubectl top nodesto identify memory-constrained nodes. Tune resource limits. - Health Check Flapping: Test probes locally; ensure application handles readiness probes correctly.
Final Note
Alert on restarts only when they indicate a failure chain. Most clusters see occasional restarts—focus on patterns, not individual events. In my experience, pairing restart metrics with error logs and resource usage gives the clearest signal for intervention.
Source thread: Do you alert on pod restarts, or is that just noise in most clusters?

Share this post
Twitter
Google+
Facebook
Reddit
LinkedIn
StumbleUpon
Pinterest
Email