Instance Monitoring

Why this guide exists

ZEC's product positioning rests on three pillars:

  1. Global coverage. ZEC reaches further than the hyperscalers, but never so far out that we cannot deliver production-grade network and compute. The locations we offer are the locations we can stand behind.

  2. Built for production. Every line of code, every metric, every operational choice exists to give workloads running on ZEC a stable, predictable environment — and to give you the evidence you need to trust it. ZEC is not a VPS; it is infrastructure for production applications.

  3. No middle layer. The product surface is intentionally small and concept-light. When something cannot be exposed as a knob, support routes the issue straight to engineering. We do not put a translation layer between you and the people who can actually fix things.

This guide is the practical expression of pillar #2.

Why a metrics catalog is part of "built for production"

Every panel on the ZEC instance monitoring dashboard exists because it was needed during a real production incident — by us, by a customer, or both. None of these panels are decorative.

But a list of panels is not enough. In an incident, the people who matter — application owners, on-call engineers, support — do not have time to learn what 30 metrics mean. They need to know:

  • Which question am I trying to answer right now?

  • Which panel answers it?

  • What do I do when the panel moves?

Without that framing, even the best metric is just a graph that nobody reads. With it, the same metric becomes the difference between a five-minute conversation and a two-day investigation.

To access the monitoring dashboard, navigate to the ZEC Instancesarrow-up-right page, select an instance, and open the Monitoring tab.

This guide is organized around the questions that come up over and over again in production troubleshooting. Each question gets one short page that lists the relevant panels, explains what they mean, and tells you what action to take when they move. If you know which question you are asking, you can go straight to the answer.

Summary

#
Question
Likely cause
What to do
Key Console panels

Q1

Infrastructure

Contact support

Hypervisor CPU Queue Time, CPU / Memory / I/O Pressure

Q2

Application

Profile the workload, tune the working set

Memory Utilization (Real Utilization), Swap In / Out, KSWAPD Steal, KSWAPD LHWM, System Load Average

Q3

Usually rate limit, occasionally infrastructure

Raise IP bandwidth or smooth bursts; if vNIC drops are non-zero, contact support

IP Bandwidth, IP Packet Transmission, vNIC Bandwidth, vNIC Packet Transmission

Q4

Application

Profile the workload

CPU Utilization / User / System / IOWait / SoftIRQ / Idle / Other, Hypervisor CPU Time / User / System

Q5

Usually saturation, occasionally backend

Reduce request rate or contact support

Throughput, Operations (IOPS), Disk Utilization

Q6

Application

Reboot triage; hunt for connection leaks

Uptime — Running for, TCP Connections

Q7

VRAM is yours; thermal / reset is infrastructure

Tune for OOM; thermal and reset events are auto-escalated

GPU panels

The "Likely cause" column tells you, at a glance, where to start looking when a panel moves. Most panels point cleanly at either your application or the underlying infrastructure — that distinction is the most important thing this dashboard gives you.

What you should expect from this dashboard

  • No vanity numbers. If a metric cannot drive a decision — open a ticket, change a config, scale a workload — it is not on the dashboard.

  • Cross-validated where it matters. CPU steal is reported from both the hypervisor side and the guest side. Memory is reported as both the cache-inclusive and the real number. You never have to take a single number's word for it.

  • Measured where the truth lives. Network drops and CPU contention are measured on the hypervisor side, because by the time the guest could observe them the evidence is already gone. Application-level signals (load average, swap activity, TCP sockets) are measured inside the guest, where the workload actually runs.

When a panel on this dashboard moves, the next action should be obvious. That is the entire point.

Last updated