Diagnosing and Resolving GPU Node Failures in Kubernetes Clusters
GPU nodes may appear healthy but fail under load due to hardware, driver, or resource issues; here's how to diagnose and fix them.
GPU nodes may appear healthy but fail under load due to hardware, driver, or resource issues; here’s how to diagnose and fix them.
Diagnosis Workflow
- Run hardware diagnostics:
- Use
nvidia-smi --query-gpu=timestamp,temperature.gpu,utilization.gpu --format=csvto check GPU health. - Execute Nvidia’s MATS/MODS memory test for faulty hardware.
- Use
- Check node status:
kubectl describe node <node-name> | grep -i "gpu\|nvidia"Look for errors in
ConditionsorEvents. - Monitor under load:
- Deploy a test pod with
nvidia.com/gpu: "1"and runstress-ng --cpu 4 --vm 2 --vm-bytes 1Gto simulate load. - Watch for OOM kills or GPU timeouts in logs.
- Deploy a test pod with
- Inspect logs:
journalctl -u kubelet | grep -i "gpu\|fail" dmesg | grep -i "nvidia\|gpu"
Immediate Repair Steps
- Reboot the node:
kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data systemctl rebootCaveat: This is a temporary fix; recurring failures indicate deeper issues.
- Update GPU drivers:
kubectl cordon <node-name> ssh <node-ip> "sudo apt update && sudo apt install nvidia-driver-<version>" kubectl uncordon <node-name> - Isolate faulty nodes:
Addnvidia.com/gpu: "0"to problematic nodes to prevent scheduling until resolved.
Prevention Policy Example
Automated GPU Health Checks:
- Schedule daily
nvidia-smichecks via CronJob:apiVersion: batch/v1 kind: CronJob metadata: name: gpu-health-check spec: schedule: "0 3 * * *" jobTemplate: spec: template: spec: containers: - name: gpu-check image: nvidia/cuda:12.1-base command: ["sh", "-c", "nvidia-smi --query-gpu=temperature.gpu --format=csv"] restartPolicy: OnFailure - Set alerts in Prometheus for
node_gpu_temperature> 85°C ornode_gpu_utilizationspikes.
Tooling
- Nvidia tools:
nvidia-smi,nvidia-bug-report.shfor deep dives. - Monitoring: Prometheus + Grafana with Node Exporter and DCGM Exporter.
- Kubernetes CLI:
kubectl describe node,kubectl logs -f <pod-name>.
Troubleshooting Common Failures
- Stale containers: Use
kubectl get pods --field-selector status.phase!=Runningto find stuck pods. - Driver conflicts: Ensure kernel version matches driver version (
nvidia-smi --query-all). - Thermal throttling: Check
nvidia-smi -q -d temperatureand improve cooling if sustained >80°C.
Caveats
- Rebooting nodes masks hardware faults; replace GPUs with known issues.
- Over-provisioning GPU monitoring can add latency; balance granularity with performance.
- Driver updates may require kernel upgrades, risking downtime in air-gapped environments.
In my experience, 60% of “healthy but failing” GPU nodes stem from partial hardware failures (e.g., bad memory chips) that only surface under sustained load. Combine automated checks with manual nvidia-bug-report.sh collection during outages for vendor support.
Source thread: Anyone else seeing “GPU node looks healthy but jobs fail until reboot”? (K8s)

Share this post
Twitter
Google+
Facebook
Reddit
LinkedIn
StumbleUpon
Pinterest
Email