> For the complete documentation index, see [llms.txt](https://docs.console.zenlayer.com/welcome/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://docs.console.zenlayer.com/welcome/elastic-compute/instance-monitoring/07-gpu.md).

# Is My GPU Healthy?

GPU panels appear only on GPU instances.

## GPU Utilization

Compute occupancy of the GPU as a percentage. This is the share of sample windows in which at least one CUDA kernel was running on the device — it tells you the GPU was *busy*, but not necessarily that it was busy with useful work. A model that is bottlenecked on host-side data loading can show high utilization while throughput is low, so always read this together with your application-side throughput numbers.

## GPU Memory Used / GPU Memory Total

VRAM accounting. GPU Memory Used is the amount currently allocated by your processes on the device; GPU Memory Total is the physical capacity of the card.

VRAM exhaustion is the most common cause of CUDA out-of-memory crashes. If a job dies unexpectedly, check this panel before assuming a host issue — running close to the total is the warning sign.

## GPU Temperature

Die temperature of the GPU, in degrees Celsius. Modern data-center GPUs report a single junction temperature (NVIDIA) or edge temperature (AMD) which the device firmware uses to drive its own thermal-management decisions.

What to look for:

* **Sustained temperatures near the throttle threshold** (typically the mid-80s to low-90s °C, vendor-dependent) — the firmware will down-clock the card to keep it safe, which shows up in your application as a sudden, unexplained performance drop while utilization stays high. If you see this with workloads that used to run cooler on the same hardware, the cooling path has degraded; contact support.
* **Sudden spikes** during a previously stable workload — usually indicates a fan event (see GPU Fan Speed below) or a hot neighbor on a shared chassis.
* **Temperatures consistently far below the throttle threshold** — healthy. Nothing to do.

You cannot directly fix GPU temperature from inside the guest; if the card is running hot, the action is either to reduce the workload's duty cycle, or to escalate to support so the chassis can be inspected.

## GPU Fan Speed

Fan speed reported as a fraction between 0 and 1 (`0.7` means the fan is running at 70% of its maximum). The card's firmware drives the fan automatically based on temperature, so this panel is most useful when read together with **GPU Temperature**:

* **Fan speed climbs in step with temperature** — the cooling loop is working as intended. Nothing to do.
* **Fan pinned at maximum while temperature continues to climb** — the cooling system has run out of headroom. The card will throttle next, and possibly hit GPU Reset Required after that. Open a ticket immediately.
* **Fan speed low while temperature is high** — the firmware is not driving the fan, or the fan controller has failed. This is a hardware fault; open a ticket.
* **Fan speed at 0** on an idle card — normal for cards with passive idle behavior.

## GPU Power Draw

Real-time power consumption of the card in watts. Power draw, temperature, and fan speed together tell you whether the card is doing real work or stuck in a degraded state — a card that is "100% utilized" but pulling far less than its rated TDP is usually waiting on something rather than computing.

## GPU Reset Required

A hard signal that the device has reported a fatal state requiring a reset. If this goes high, support and engineering are already looking at it from the infrastructure side; the affected workload will not recover until the card is reset.

## API Reference

GPU monitoring metrics are not yet available via the API. All GPU panels are console only.