Instance Monitoring
Why this guide exists
ZEC's product positioning rests on three pillars:
Global coverage. ZEC reaches further than the hyperscalers, but never so far out that we cannot deliver production-grade network and compute. The locations we offer are the locations we can stand behind.
Built for production. Every line of code, every metric, every operational choice exists to give workloads running on ZEC a stable, predictable environment — and to give you the evidence you need to trust it. ZEC is not a VPS; it is infrastructure for production applications.
No middle layer. The product surface is intentionally small and concept-light. When something cannot be exposed as a knob, support routes the issue straight to engineering. We do not put a translation layer between you and the people who can actually fix things.
This guide is the practical expression of pillar #2.
Why a metrics catalog is part of "built for production"
Every panel on the ZEC instance monitoring dashboard exists because it was needed during a real production incident — by us, by a customer, or both. None of these panels are decorative.
But a list of panels is not enough. In an incident, the people who matter — application owners, on-call engineers, support — do not have time to learn what 30 metrics mean. They need to know:
Which question am I trying to answer right now?
Which panel answers it?
What do I do when the panel moves?
Without that framing, even the best metric is just a graph that nobody reads. With it, the same metric becomes the difference between a five-minute conversation and a two-day investigation.
To access the monitoring dashboard, navigate to the ZEC Instances page, select an instance, and open the Monitoring tab.
This guide is organized around the questions that come up over and over again in production troubleshooting. Each question gets one short page that lists the relevant panels, explains what they mean, and tells you what action to take when they move. If you know which question you are asking, you can go straight to the answer.
Summary
Q1
Infrastructure
Contact support
Hypervisor CPU Queue Time, CPU / Memory / I/O Pressure
Q2
Application
Profile the workload, tune the working set
Memory Utilization (Real Utilization), Swap In / Out, KSWAPD Steal, KSWAPD LHWM, System Load Average
Q3
Usually rate limit, occasionally infrastructure
Raise IP bandwidth or smooth bursts; if vNIC drops are non-zero, contact support
IP Bandwidth, IP Packet Transmission, vNIC Bandwidth, vNIC Packet Transmission
Q4
Application
Profile the workload
CPU Utilization / User / System / IOWait / SoftIRQ / Idle / Other, Hypervisor CPU Time / User / System
Q5
Usually saturation, occasionally backend
Reduce request rate or contact support
Throughput, Operations (IOPS), Disk Utilization
Q6
Application
Reboot triage; hunt for connection leaks
Uptime — Running for, TCP Connections
Q7
VRAM is yours; thermal / reset is infrastructure
Tune for OOM; thermal and reset events are auto-escalated
GPU panels
The "Likely cause" column tells you, at a glance, where to start looking when a panel moves. Most panels point cleanly at either your application or the underlying infrastructure — that distinction is the most important thing this dashboard gives you.
What you should expect from this dashboard
No vanity numbers. If a metric cannot drive a decision — open a ticket, change a config, scale a workload — it is not on the dashboard.
Cross-validated where it matters. CPU steal is reported from both the hypervisor side and the guest side. Memory is reported as both the cache-inclusive and the real number. You never have to take a single number's word for it.
Measured where the truth lives. Network drops and CPU contention are measured on the hypervisor side, because by the time the guest could observe them the evidence is already gone. Application-level signals (load average, swap activity, TCP sockets) are measured inside the guest, where the workload actually runs.
When a panel on this dashboard moves, the next action should be obvious. That is the entire point.
Last updated