# Health Check

The health checker continuously probes every backend in a listener's pool and updates each backend's health status. Unhealthy backends are pulled out of scheduling; recovered backends come back in. This is what turns a static list of VMs into a self-healing pool.

Health checks run independently per listener. A VM that appears in two listeners is probed twice — once for each listener's settings — and can be healthy in one while unhealthy in the other, if their probes target different ports.

![Health check state machine](/files/PAtejPhVRL5tANJpsb2F)

## Probe Types

Four probe types are available, configured per listener:

| Type         | What it does                                                                                                                                                                                                          |
| ------------ | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `TCP_CHECK`  | Opens a TCP connection to the backend's check port. Success = completed handshake. The probe closes the connection immediately after.                                                                                 |
| `HTTP_GET`   | Issues an HTTP `GET` to a configured path and compares the response's status code to an expected value (e.g., `200`). Success = matching status. Works for both HTTP and HTTPS — the `http_protocol` is configurable. |
| `MISC_CHECK` | Runs a platform-provided custom probe. Used for services where TCP and HTTP probes aren't expressive enough. Contact [support](mailto:support@zenlayer.com) to configure.                                             |
| `PING_CHECK` | ICMP echo to the backend. Rarely useful as a *service* health check — a VM can ping but its service can be broken. Prefer `TCP_CHECK` or `HTTP_GET` where possible.                                                   |

### Which probe type to use

* **`TCP_CHECK`** is the default choice for arbitrary TCP services. It verifies the backend accepts connections. It does *not* verify the service is actually serving requests — a stuck backend may still accept connections without processing them.
* **`HTTP_GET`** is the right choice for HTTP or HTTPS services. Hit a dedicated `/healthz` endpoint that returns 200 only when the service can serve real traffic. This catches stuck-service cases that `TCP_CHECK` misses.
* **`MISC_CHECK`** is an escape hatch for custom logic — a script can do anything, from protocol-specific handshakes to querying a downstream dependency.
* **`PING_CHECK`** should only be used when no other option works. It checks host liveness, not service liveness.

## Probe Parameters

Each listener's health check is configured with these parameters:

| Parameter             | Default | Description                                                                                                                                   |
| --------------------- | ------- | --------------------------------------------------------------------------------------------------------------------------------------------- |
| **EnableHealthCheck** | on      | Whether the health checker runs at all for this listener.                                                                                     |
| **CheckPort**         | `0`     | Port to probe. `0` means "same as backend port." Use a non-zero value to probe a health endpoint on a different port than the service itself. |
| **DelayLoop**         | `10 s`  | Interval between probes.                                                                                                                      |
| **ConnectionTimeout** | `3 s`   | How long to wait for the probe to succeed before counting it as a failure.                                                                    |
| **RetryCount**        | `1`     | Consecutive failures needed to mark a healthy backend unhealthy.                                                                              |
| **DelayRetry**        | `3 s`   | Interval between retry attempts after an initial failure.                                                                                     |
| **HttpCheckUrl**      | —       | For `HTTP_GET`: the path to request (e.g., `/healthz`).                                                                                       |
| **HttpStatusCode**    | `200`   | For `HTTP_GET`: expected response status.                                                                                                     |

### Picking parameters

* **DelayLoop** — more frequent probes detect failures faster but generate more traffic. For services with a long warm-up, a shorter loop makes added backends serve traffic sooner (they're only eligible after a successful probe). `10 s` is reasonable for most services.
* **ConnectionTimeout** — set comfortably above the backend's normal response time. A too-tight timeout produces false positives during load spikes.
* **RetryCount** — a higher retry count makes the checker more tolerant of transient failures but delays genuine-failure detection. Two or three is a common middle ground for noisy environments; one is fine for stable backends.
* **HttpCheckUrl** — point at a dedicated health endpoint, not at a user-facing route. The endpoint should check the parts of the service that matter (database reachable, dependencies up) and return 200 only when real traffic can be served.

***

## State Transitions

A backend moves through four effective states:

| State              | Receives new flows?            | How to get here                                                             |
| ------------------ | ------------------------------ | --------------------------------------------------------------------------- |
| Healthy            | Yes                            | Current probe succeeds, and prior probes have succeeded.                    |
| Probing / retrying | Yes (still considered healthy) | A probe fails, but `RetryCount` retries haven't run out yet.                |
| Unhealthy          | No                             | Failed probes + retries exhausted — the backend is removed from scheduling. |
| Recovering         | No (still excluded)            | Unhealthy backend where the next probe is pending. First success re-admits. |

**Going unhealthy.** A healthy backend enters *probing* on the first failure. It stays there while retries run (separated by `DelayRetry`). Once `RetryCount` consecutive failures accumulate, the backend is marked *unhealthy* and pulled from scheduling. In-flight connections continue to run but no new flows are scheduled.

**Recovering.** An unhealthy backend keeps being probed at the normal `DelayLoop` interval. One successful probe re-admits it — there is no "successes required" threshold on the recovery side. If you want stricter recovery, raise `DelayLoop` so that a single success represents a longer stability window.

## All-Failed Fallback

When every backend in a listener fails the health check at the same time, two things could go wrong: either every backend really is down (unlikely but possible), or the health check itself is broken (probe target is wrong, an intermediate firewall rule changed, etc.). The listener has a configurable response:

| `HealthTreatFailure` value | Behavior                                                                                                                                               |
| -------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------ |
| `0` (default — hard fail)  | Backends stay marked unhealthy. No new flows are scheduled; the service returns errors to clients.                                                     |
| `1` (treat as healthy)     | Every backend is treated as if healthy for scheduling purposes, even though probes are failing. Flows are distributed as normal while you investigate. |

**Which to pick:**

* **Hard fail** is safer for most services — if every backend really is down, sending traffic to them does nothing useful.
* **Treat as healthy** is useful for services where keeping the door open is better than going fully dark. If the probe is the thing that's broken (not the service), this keeps traffic flowing until you fix it.

This setting only kicks in when *all* backends are unhealthy. If any backend is healthy, traffic goes to the healthy ones, regardless of the fallback value.

## Isolation and Recovery Behavior

* **Unhealthy backends still hold their existing connections.** The LB does not tear down active flows when a backend fails a probe — it stops scheduling *new* flows. Existing connections run until they close normally or idle out. If the backend is truly dead, those connections will fail at the application layer.
* **Recovered backends receive new traffic on the next scheduling decision.** The scheduler re-evaluates on each new flow; once the backend is healthy, it starts being chosen again.
* **Weights and persistence interact with health.** With session persistence on, a recovered backend is eligible for new *sticky-able* flows, but existing sticky entries pointing at a different backend don't move. With `wIc`, recovered backends that come back at weight 3 will start with zero active flows and may briefly attract a disproportionate share until the counts re-level.

## Disabling Health Checks

You can disable health checks on a listener entirely (`EnableHealthCheck = 0`). When disabled, every backend is always considered eligible for scheduling regardless of reachability.

Disable only if something else is managing backend membership — for example, a control system that adds and removes backends based on its own liveness signal. Otherwise, a dead backend will keep receiving flows until you notice.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.console.zenlayer.com/welcome/elastic-compute/load-balancing/05-health-check.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
