# Instance Monitoring

## Why this guide exists

ZEC's product positioning rests on three pillars:

1. **Global coverage.** ZEC reaches further than the hyperscalers, but never so far out that we cannot deliver production-grade network and compute. The locations we offer are the locations we can stand behind.
2. **Built for production.** Every line of code, every metric, every operational choice exists to give workloads running on ZEC a stable, predictable environment — and to give you the evidence you need to trust it. ZEC is not a VPS; it is infrastructure for production applications.
3. **No middle layer.** The product surface is intentionally small and concept-light. When something cannot be exposed as a knob, support routes the issue straight to engineering. We do not put a translation layer between you and the people who can actually fix things.

This guide is the practical expression of pillar #2.

## Why a metrics catalog is part of "built for production"

Every panel on the ZEC instance monitoring dashboard exists because it was needed during a real production incident — by us, by a customer, or both. None of these panels are decorative.

But a list of panels is not enough. In an incident, the people who matter — application owners, on-call engineers, support — do not have time to learn what 30 metrics mean. They need to know:

* **Which question am I trying to answer right now?**
* **Which panel answers it?**
* **What do I do when the panel moves?**

Without that framing, even the best metric is just a graph that nobody reads. With it, the same metric becomes the difference between a five-minute conversation and a two-day investigation.

To access the monitoring dashboard, navigate to the [ZEC Instances](https://console.zenlayer.com/zec/virtualMachine) page, select an instance, and open the **Monitoring** tab.

This guide is organized around the questions that come up over and over again in production troubleshooting. Each question gets one short page that lists the relevant panels, explains what they mean, and tells you what action to take when they move. If you know which question you are asking, you can go straight to the answer.

## Summary

| #  | Question                                                                                                                        | Likely cause                                     | What to do                                                                       | Key Console panels                                                                                     |
| -- | ------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------ | -------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------ |
| Q1 | [Is my instance being affected by host over-provisioning?](/welcome/elastic-compute/instance-monitoring/01-overprovisioning.md) | Infrastructure                                   | Contact support                                                                  | Hypervisor CPU Queue Time, CPU / Memory / I/O Pressure                                                 |
| Q2 | [Is my own workload the problem?](/welcome/elastic-compute/instance-monitoring/02-workload-self.md)                             | Application                                      | Profile the workload, tune the working set                                       | Memory Utilization (Real Utilization), Swap In / Out, KSWAPD Steal, KSWAPD LHWM, System Load Average   |
| Q3 | [Did the network layer drop my packets?](/welcome/elastic-compute/instance-monitoring/03-network-drops.md)                      | Usually rate limit, occasionally infrastructure  | Raise IP bandwidth or smooth bursts; if vNIC drops are non-zero, contact support | IP Bandwidth, IP Packet Transmission, vNIC Bandwidth, vNIC Packet Transmission                         |
| Q4 | [How is my CPU actually being spent?](/welcome/elastic-compute/instance-monitoring/04-cpu-spend.md)                             | Application                                      | Profile the workload                                                             | CPU Utilization / User / System / IOWait / SoftIRQ / Idle / Other, Hypervisor CPU Time / User / System |
| Q5 | [How is my disk behaving?](/welcome/elastic-compute/instance-monitoring/05-disk.md)                                             | Usually saturation, occasionally backend         | Reduce request rate or contact support                                           | Throughput, Operations (IOPS), Disk Utilization                                                        |
| Q6 | [Is the instance alive, and how is its OS doing?](/welcome/elastic-compute/instance-monitoring/06-liveness.md)                  | Application                                      | Reboot triage; hunt for connection leaks                                         | Uptime — Running for, TCP Connections                                                                  |
| Q7 | [Is my GPU healthy?](/welcome/elastic-compute/instance-monitoring/07-gpu.md)                                                    | VRAM is yours; thermal / reset is infrastructure | Tune for OOM; thermal and reset events are auto-escalated                        | GPU panels                                                                                             |

The "Likely cause" column tells you, at a glance, where to start looking when a panel moves. Most panels point cleanly at either your application or the underlying infrastructure — that distinction is the most important thing this dashboard gives you.

## What you should expect from this dashboard

* **No vanity numbers.** If a metric cannot drive a decision — open a ticket, change a config, scale a workload — it is not on the dashboard.
* **Cross-validated where it matters.** CPU steal is reported from both the hypervisor side and the guest side. Memory is reported as both the cache-inclusive and the real number. You never have to take a single number's word for it.
* **Measured where the truth lives.** Network drops and CPU contention are measured on the hypervisor side, because by the time the guest could observe them the evidence is already gone. Application-level signals (load average, swap activity, TCP sockets) are measured inside the guest, where the workload actually runs.

When a panel on this dashboard moves, the next action should be obvious. That is the entire point.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.console.zenlayer.com/welcome/elastic-compute/instance-monitoring.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
