Health Check

The health checker continuously probes every backend in a listener's pool and updates each backend's health status. Unhealthy backends are pulled out of scheduling; recovered backends come back in. This is what turns a static list of VMs into a self-healing pool.

Health checks run independently per listener. A VM that appears in two listeners is probed twice — once for each listener's settings — and can be healthy in one while unhealthy in the other, if their probes target different ports.

Health check state machine

Probe Types

Four probe types are available, configured per listener:

Type
What it does

TCP_CHECK

Opens a TCP connection to the backend's check port. Success = completed handshake. The probe closes the connection immediately after.

HTTP_GET

Issues an HTTP GET to a configured path and compares the response's status code to an expected value (e.g., 200). Success = matching status. Works for both HTTP and HTTPS — the http_protocol is configurable.

MISC_CHECK

Runs a platform-provided custom probe. Used for services where TCP and HTTP probes aren't expressive enough. Contact supportenvelope to configure.

PING_CHECK

ICMP echo to the backend. Rarely useful as a service health check — a VM can ping but its service can be broken. Prefer TCP_CHECK or HTTP_GET where possible.

Which probe type to use

  • TCP_CHECK is the default choice for arbitrary TCP services. It verifies the backend accepts connections. It does not verify the service is actually serving requests — a stuck backend may still accept connections without processing them.

  • HTTP_GET is the right choice for HTTP or HTTPS services. Hit a dedicated /healthz endpoint that returns 200 only when the service can serve real traffic. This catches stuck-service cases that TCP_CHECK misses.

  • MISC_CHECK is an escape hatch for custom logic — a script can do anything, from protocol-specific handshakes to querying a downstream dependency.

  • PING_CHECK should only be used when no other option works. It checks host liveness, not service liveness.

Probe Parameters

Each listener's health check is configured with these parameters:

Parameter
Default
Description

EnableHealthCheck

on

Whether the health checker runs at all for this listener.

CheckPort

0

Port to probe. 0 means "same as backend port." Use a non-zero value to probe a health endpoint on a different port than the service itself.

DelayLoop

10 s

Interval between probes.

ConnectionTimeout

3 s

How long to wait for the probe to succeed before counting it as a failure.

RetryCount

1

Consecutive failures needed to mark a healthy backend unhealthy.

DelayRetry

3 s

Interval between retry attempts after an initial failure.

HttpCheckUrl

For HTTP_GET: the path to request (e.g., /healthz).

HttpStatusCode

200

For HTTP_GET: expected response status.

Picking parameters

  • DelayLoop — more frequent probes detect failures faster but generate more traffic. For services with a long warm-up, a shorter loop makes added backends serve traffic sooner (they're only eligible after a successful probe). 10 s is reasonable for most services.

  • ConnectionTimeout — set comfortably above the backend's normal response time. A too-tight timeout produces false positives during load spikes.

  • RetryCount — a higher retry count makes the checker more tolerant of transient failures but delays genuine-failure detection. Two or three is a common middle ground for noisy environments; one is fine for stable backends.

  • HttpCheckUrl — point at a dedicated health endpoint, not at a user-facing route. The endpoint should check the parts of the service that matter (database reachable, dependencies up) and return 200 only when real traffic can be served.


State Transitions

A backend moves through four effective states:

State
Receives new flows?
How to get here

Healthy

Yes

Current probe succeeds, and prior probes have succeeded.

Probing / retrying

Yes (still considered healthy)

A probe fails, but RetryCount retries haven't run out yet.

Unhealthy

No

Failed probes + retries exhausted — the backend is removed from scheduling.

Recovering

No (still excluded)

Unhealthy backend where the next probe is pending. First success re-admits.

Going unhealthy. A healthy backend enters probing on the first failure. It stays there while retries run (separated by DelayRetry). Once RetryCount consecutive failures accumulate, the backend is marked unhealthy and pulled from scheduling. In-flight connections continue to run but no new flows are scheduled.

Recovering. An unhealthy backend keeps being probed at the normal DelayLoop interval. One successful probe re-admits it — there is no "successes required" threshold on the recovery side. If you want stricter recovery, raise DelayLoop so that a single success represents a longer stability window.

All-Failed Fallback

When every backend in a listener fails the health check at the same time, two things could go wrong: either every backend really is down (unlikely but possible), or the health check itself is broken (probe target is wrong, an intermediate firewall rule changed, etc.). The listener has a configurable response:

HealthTreatFailure value

Behavior

0 (default — hard fail)

Backends stay marked unhealthy. No new flows are scheduled; the service returns errors to clients.

1 (treat as healthy)

Every backend is treated as if healthy for scheduling purposes, even though probes are failing. Flows are distributed as normal while you investigate.

Which to pick:

  • Hard fail is safer for most services — if every backend really is down, sending traffic to them does nothing useful.

  • Treat as healthy is useful for services where keeping the door open is better than going fully dark. If the probe is the thing that's broken (not the service), this keeps traffic flowing until you fix it.

This setting only kicks in when all backends are unhealthy. If any backend is healthy, traffic goes to the healthy ones, regardless of the fallback value.

Isolation and Recovery Behavior

  • Unhealthy backends still hold their existing connections. The LB does not tear down active flows when a backend fails a probe — it stops scheduling new flows. Existing connections run until they close normally or idle out. If the backend is truly dead, those connections will fail at the application layer.

  • Recovered backends receive new traffic on the next scheduling decision. The scheduler re-evaluates on each new flow; once the backend is healthy, it starts being chosen again.

  • Weights and persistence interact with health. With session persistence on, a recovered backend is eligible for new sticky-able flows, but existing sticky entries pointing at a different backend don't move. With wIc, recovered backends that come back at weight 3 will start with zero active flows and may briefly attract a disproportionate share until the counts re-level.

Disabling Health Checks

You can disable health checks on a listener entirely (EnableHealthCheck = 0). When disabled, every backend is always considered eligible for scheduling regardless of reachability.

Disable only if something else is managing backend membership — for example, a control system that adds and removes backends based on its own liveness signal. Otherwise, a dead backend will keep receiving flows until you notice.

Last updated