Diagnosing and Resolving GPU Node Failures in Kubernetes Clusters

GPU nodes may appear healthy but fail under load due to hardware, driver, or resource issues; here's how to diagnose and fix them.

JR

2 minute read

GPU nodes may appear healthy but fail under load due to hardware, driver, or resource issues; here’s how to diagnose and fix them.

Diagnosis Workflow

  1. Run hardware diagnostics:
    • Use nvidia-smi --query-gpu=timestamp,temperature.gpu,utilization.gpu --format=csv to check GPU health.
    • Execute Nvidia’s MATS/MODS memory test for faulty hardware.
  2. Check node status:
    kubectl describe node <node-name> | grep -i "gpu\|nvidia"
    

    Look for errors in Conditions or Events.

  3. Monitor under load:
    • Deploy a test pod with nvidia.com/gpu: "1" and run stress-ng --cpu 4 --vm 2 --vm-bytes 1G to simulate load.
    • Watch for OOM kills or GPU timeouts in logs.
  4. Inspect logs:
    journalctl -u kubelet | grep -i "gpu\|fail"
    dmesg | grep -i "nvidia\|gpu"
    

Immediate Repair Steps

  • Reboot the node:
    kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data
    systemctl reboot
    

    Caveat: This is a temporary fix; recurring failures indicate deeper issues.

  • Update GPU drivers:
    kubectl cordon <node-name>
    ssh <node-ip> "sudo apt update && sudo apt install nvidia-driver-<version>"
    kubectl uncordon <node-name>
    
  • Isolate faulty nodes:
    Add nvidia.com/gpu: "0" to problematic nodes to prevent scheduling until resolved.

Prevention Policy Example

Automated GPU Health Checks:

  • Schedule daily nvidia-smi checks via CronJob:
    apiVersion: batch/v1
    kind: CronJob
    metadata:
      name: gpu-health-check
    spec:
      schedule: "0 3 * * *"
      jobTemplate:
        spec:
          template:
            spec:
              containers:
              - name: gpu-check
                image: nvidia/cuda:12.1-base
                command: ["sh", "-c", "nvidia-smi --query-gpu=temperature.gpu --format=csv"]
              restartPolicy: OnFailure
    
  • Set alerts in Prometheus for node_gpu_temperature > 85°C or node_gpu_utilization spikes.

Tooling

  • Nvidia tools: nvidia-smi, nvidia-bug-report.sh for deep dives.
  • Monitoring: Prometheus + Grafana with Node Exporter and DCGM Exporter.
  • Kubernetes CLI: kubectl describe node, kubectl logs -f <pod-name>.

Troubleshooting Common Failures

  • Stale containers: Use kubectl get pods --field-selector status.phase!=Running to find stuck pods.
  • Driver conflicts: Ensure kernel version matches driver version (nvidia-smi --query-all).
  • Thermal throttling: Check nvidia-smi -q -d temperature and improve cooling if sustained >80°C.

Caveats

  • Rebooting nodes masks hardware faults; replace GPUs with known issues.
  • Over-provisioning GPU monitoring can add latency; balance granularity with performance.
  • Driver updates may require kernel upgrades, risking downtime in air-gapped environments.

In my experience, 60% of “healthy but failing” GPU nodes stem from partial hardware failures (e.g., bad memory chips) that only surface under sustained load. Combine automated checks with manual nvidia-bug-report.sh collection during outages for vendor support.

Source thread: Anyone else seeing “GPU node looks healthy but jobs fail until reboot”? (K8s)

comments powered by Disqus