Health Check
The health checker continuously probes every backend in a listener's pool and updates each backend's health status. Unhealthy backends are pulled out of scheduling; recovered backends come back in. This is what turns a static list of VMs into a self-healing pool.
Health checks run independently per listener. A VM that appears in two listeners is probed twice — once for each listener's settings — and can be healthy in one while unhealthy in the other, if their probes target different ports.
Probe Types
Four probe types are available, configured per listener:
TCP_CHECK
Opens a TCP connection to the backend's check port. Success = completed handshake. The probe closes the connection immediately after.
HTTP_GET
Issues an HTTP GET to a configured path and compares the response's status code to an expected value (e.g., 200). Success = matching status. Works for both HTTP and HTTPS — the http_protocol is configurable.
MISC_CHECK
Runs a platform-provided custom probe. Used for services where TCP and HTTP probes aren't expressive enough. Contact support to configure.
PING_CHECK
ICMP echo to the backend. Rarely useful as a service health check — a VM can ping but its service can be broken. Prefer TCP_CHECK or HTTP_GET where possible.
Which probe type to use
TCP_CHECKis the default choice for arbitrary TCP services. It verifies the backend accepts connections. It does not verify the service is actually serving requests — a stuck backend may still accept connections without processing them.HTTP_GETis the right choice for HTTP or HTTPS services. Hit a dedicated/healthzendpoint that returns 200 only when the service can serve real traffic. This catches stuck-service cases thatTCP_CHECKmisses.MISC_CHECKis an escape hatch for custom logic — a script can do anything, from protocol-specific handshakes to querying a downstream dependency.PING_CHECKshould only be used when no other option works. It checks host liveness, not service liveness.
Probe Parameters
Each listener's health check is configured with these parameters:
EnableHealthCheck
on
Whether the health checker runs at all for this listener.
CheckPort
0
Port to probe. 0 means "same as backend port." Use a non-zero value to probe a health endpoint on a different port than the service itself.
DelayLoop
10 s
Interval between probes.
ConnectionTimeout
3 s
How long to wait for the probe to succeed before counting it as a failure.
RetryCount
1
Consecutive failures needed to mark a healthy backend unhealthy.
DelayRetry
3 s
Interval between retry attempts after an initial failure.
HttpCheckUrl
—
For HTTP_GET: the path to request (e.g., /healthz).
HttpStatusCode
200
For HTTP_GET: expected response status.
Picking parameters
DelayLoop — more frequent probes detect failures faster but generate more traffic. For services with a long warm-up, a shorter loop makes added backends serve traffic sooner (they're only eligible after a successful probe).
10 sis reasonable for most services.ConnectionTimeout — set comfortably above the backend's normal response time. A too-tight timeout produces false positives during load spikes.
RetryCount — a higher retry count makes the checker more tolerant of transient failures but delays genuine-failure detection. Two or three is a common middle ground for noisy environments; one is fine for stable backends.
HttpCheckUrl — point at a dedicated health endpoint, not at a user-facing route. The endpoint should check the parts of the service that matter (database reachable, dependencies up) and return 200 only when real traffic can be served.
State Transitions
A backend moves through four effective states:
Healthy
Yes
Current probe succeeds, and prior probes have succeeded.
Probing / retrying
Yes (still considered healthy)
A probe fails, but RetryCount retries haven't run out yet.
Unhealthy
No
Failed probes + retries exhausted — the backend is removed from scheduling.
Recovering
No (still excluded)
Unhealthy backend where the next probe is pending. First success re-admits.
Going unhealthy. A healthy backend enters probing on the first failure. It stays there while retries run (separated by DelayRetry). Once RetryCount consecutive failures accumulate, the backend is marked unhealthy and pulled from scheduling. In-flight connections continue to run but no new flows are scheduled.
Recovering. An unhealthy backend keeps being probed at the normal DelayLoop interval. One successful probe re-admits it — there is no "successes required" threshold on the recovery side. If you want stricter recovery, raise DelayLoop so that a single success represents a longer stability window.
All-Failed Fallback
When every backend in a listener fails the health check at the same time, two things could go wrong: either every backend really is down (unlikely but possible), or the health check itself is broken (probe target is wrong, an intermediate firewall rule changed, etc.). The listener has a configurable response:
HealthTreatFailure value
Behavior
0 (default — hard fail)
Backends stay marked unhealthy. No new flows are scheduled; the service returns errors to clients.
1 (treat as healthy)
Every backend is treated as if healthy for scheduling purposes, even though probes are failing. Flows are distributed as normal while you investigate.
Which to pick:
Hard fail is safer for most services — if every backend really is down, sending traffic to them does nothing useful.
Treat as healthy is useful for services where keeping the door open is better than going fully dark. If the probe is the thing that's broken (not the service), this keeps traffic flowing until you fix it.
This setting only kicks in when all backends are unhealthy. If any backend is healthy, traffic goes to the healthy ones, regardless of the fallback value.
Isolation and Recovery Behavior
Unhealthy backends still hold their existing connections. The LB does not tear down active flows when a backend fails a probe — it stops scheduling new flows. Existing connections run until they close normally or idle out. If the backend is truly dead, those connections will fail at the application layer.
Recovered backends receive new traffic on the next scheduling decision. The scheduler re-evaluates on each new flow; once the backend is healthy, it starts being chosen again.
Weights and persistence interact with health. With session persistence on, a recovered backend is eligible for new sticky-able flows, but existing sticky entries pointing at a different backend don't move. With
wIc, recovered backends that come back at weight 3 will start with zero active flows and may briefly attract a disproportionate share until the counts re-level.
Disabling Health Checks
You can disable health checks on a listener entirely (EnableHealthCheck = 0). When disabled, every backend is always considered eligible for scheduling regardless of reachability.
Disable only if something else is managing backend membership — for example, a control system that adds and removes backends based on its own liveness signal. Otherwise, a dead backend will keep receiving flows until you notice.
Last updated