Diagnosing and Resolving GPU Node Failures in Kubernetes Clusters

GPU nodes may appear healthy but fail under load due to hardware, driver, or resource issues; here's how to diagnose and fix them.

March 4, 2026 JR

2 minute read

GPU nodes may appear healthy but fail under load due to hardware, driver, or resource issues; here’s how to diagnose and fix them.

Diagnosis Workflow

Run hardware diagnostics:
- Use nvidia-smi --query-gpu=timestamp,temperature.gpu,utilization.gpu --format=csv to check GPU health.
- Execute Nvidia’s MATS/MODS memory test for faulty hardware.

Check node status:

kubectl describe node <node-name> | grep -i "gpu\|nvidia"

Look for errors in Conditions or Events.

Monitor under load:
- Deploy a test pod with nvidia.com/gpu: "1" and run stress-ng --cpu 4 --vm 2 --vm-bytes 1G to simulate load.
- Watch for OOM kills or GPU timeouts in logs.

Inspect logs:

journalctl -u kubelet | grep -i "gpu\|fail"
dmesg | grep -i "nvidia\|gpu"

Immediate Repair Steps

Reboot the node:
```
kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data
systemctl reboot
```
Caveat: This is a temporary fix; recurring failures indicate deeper issues.

Update GPU drivers:

kubectl cordon <node-name>
ssh <node-ip> "sudo apt update && sudo apt install nvidia-driver-<version>"
kubectl uncordon <node-name>

Isolate faulty nodes:
Add nvidia.com/gpu: "0" to problematic nodes to prevent scheduling until resolved.

Prevention Policy Example

Automated GPU Health Checks:

Schedule daily nvidia-smi checks via CronJob:

apiVersion: batch/v1
kind: CronJob
metadata:
  name: gpu-health-check
spec:
  schedule: "0 3 * * *"
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: gpu-check
            image: nvidia/cuda:12.1-base
            command: ["sh", "-c", "nvidia-smi --query-gpu=temperature.gpu --format=csv"]
          restartPolicy: OnFailure

Set alerts in Prometheus for node_gpu_temperature > 85°C or node_gpu_utilization spikes.

Tooling

Nvidia tools: nvidia-smi, nvidia-bug-report.sh for deep dives.
Monitoring: Prometheus + Grafana with Node Exporter and DCGM Exporter.
Kubernetes CLI: kubectl describe node, kubectl logs -f <pod-name>.

Troubleshooting Common Failures

Stale containers: Use kubectl get pods --field-selector status.phase!=Running to find stuck pods.
Driver conflicts: Ensure kernel version matches driver version (nvidia-smi --query-all).
Thermal throttling: Check nvidia-smi -q -d temperature and improve cooling if sustained >80°C.

Caveats

Rebooting nodes masks hardware faults; replace GPUs with known issues.
Over-provisioning GPU monitoring can add latency; balance granularity with performance.
Driver updates may require kernel upgrades, risking downtime in air-gapped environments.

In my experience, 60% of “healthy but failing” GPU nodes stem from partial hardware failures (e.g., bad memory chips) that only surface under sustained load. Combine automated checks with manual nvidia-bug-report.sh collection during outages for vendor support.

Source thread: Anyone else seeing “GPU node looks healthy but jobs fail until reboot”? (K8s)

blog

Home

About

Blog

Projects

Posts

Categories

Contact

Recent Posts

Kubernetes API Server Internals Explained

Managing Pod Disruption Budgets with Aggressive Hpa Scaling

Join Weekly Ebpf Debugging Sessions for Production Kubernetes

Migrating K3s from Baremetal to AWS Eks: a Pragmatic Approach

Running Kubernetes on Hetzner: Practical Setup and Pitfalls