Is My GPU Healthy?

GPU panels appear only on GPU instances.

GPU Utilization

Compute occupancy of the GPU as a percentage. This is the share of sample windows in which at least one CUDA kernel was running on the device — it tells you the GPU was busy, but not necessarily that it was busy with useful work. A model that is bottlenecked on host-side data loading can show high utilization while throughput is low, so always read this together with your application-side throughput numbers.

GPU Memory Used / GPU Memory Total

VRAM accounting. GPU Memory Used is the amount currently allocated by your processes on the device; GPU Memory Total is the physical capacity of the card.

VRAM exhaustion is the most common cause of CUDA out-of-memory crashes. If a job dies unexpectedly, check this panel before assuming a host issue — running close to the total is the warning sign.

GPU Temperature

Die temperature of the GPU, in degrees Celsius. Modern data-center GPUs report a single junction temperature (NVIDIA) or edge temperature (AMD) which the device firmware uses to drive its own thermal-management decisions.

What to look for:

  • Sustained temperatures near the throttle threshold (typically the mid-80s to low-90s °C, vendor-dependent) — the firmware will down-clock the card to keep it safe, which shows up in your application as a sudden, unexplained performance drop while utilization stays high. If you see this with workloads that used to run cooler on the same hardware, the cooling path has degraded; contact support.

  • Sudden spikes during a previously stable workload — usually indicates a fan event (see GPU Fan Speed below) or a hot neighbor on a shared chassis.

  • Temperatures consistently far below the throttle threshold — healthy. Nothing to do.

You cannot directly fix GPU temperature from inside the guest; if the card is running hot, the action is either to reduce the workload's duty cycle, or to escalate to support so the chassis can be inspected.

GPU Fan Speed

Fan speed reported as a fraction between 0 and 1 (0.7 means the fan is running at 70% of its maximum). The card's firmware drives the fan automatically based on temperature, so this panel is most useful when read together with GPU Temperature:

  • Fan speed climbs in step with temperature — the cooling loop is working as intended. Nothing to do.

  • Fan pinned at maximum while temperature continues to climb — the cooling system has run out of headroom. The card will throttle next, and possibly hit GPU Reset Required after that. Open a ticket immediately.

  • Fan speed low while temperature is high — the firmware is not driving the fan, or the fan controller has failed. This is a hardware fault; open a ticket.

  • Fan speed at 0 on an idle card — normal for cards with passive idle behavior.

GPU Power Draw

Real-time power consumption of the card in watts. Power draw, temperature, and fan speed together tell you whether the card is doing real work or stuck in a degraded state — a card that is "100% utilized" but pulling far less than its rated TDP is usually waiting on something rather than computing.

GPU Reset Required

A hard signal that the device has reported a fatal state requiring a reset. If this goes high, support and engineering are already looking at it from the infrastructure side; the affected workload will not recover until the card is reset.

API Reference

GPU monitoring metrics are not yet available via the API. All GPU panels are console only.

Last updated