Best Practices

Short, opinionated recommendations. They codify choices that work for most services; the rest of the guide explains the knobs in detail.

Instance Design

Start with one instance per service, not per port. A single instance can hold many listeners. Group everything a service exposes — web, API, admin, metrics — on one instance so it shares VIPs, whitelists, and security-group membership. Split only when lifecycles actually differ (different teams, different compliance scopes, different release cadences).

Keep IPv4 and IPv6 as parallel instances. Each instance is single-family. If you need dual-stack, create one instance per family with the same listeners and backends, and publish both addresses in DNS.

Listener Configuration

Default scheduler: mh (consistent hash). 5-tuple stability is usually what you want. Connections from the same client stick to the same backend without an explicit persistence timeout.

Switch to wIc when backends aren't identical. Mixed instance sizes, rolling upgrades to a newer VM family, or deliberately unequal pools — wIc lets weights do the work.

Enable session persistence only when necessary. Persistence adds state you have to reason about. If your service is stateless, skip it. If it holds per-session state that's expensive to move, pair persistence with mh or wIc — not wrr.

Set idle timeout to match your protocol's quiet periods. HTTP APIs: tens of seconds. Long-lived websockets or database pools: hundreds of seconds to minutes. If clients complain about unexpected disconnects during quiet periods, the idle timeout is the usual culprit.

Backend Configuration

Prefer backend port = 0 (same as listener port). It's the simplest mental model and what most services expect. Reach for a distinct backend port only when you have to — e.g., the service internally listens on 8443 while clients hit 443.

Drain before removing. Set weight to 0 first, wait for active connections to drain, then remove. Removing a backend with active connections still in the table causes mid-flow application errors.

Keep pool size below the cap. Large pools amplify health-check traffic and increase the reshuffle surface when you change membership.

One load balancer per region. Backends must share the load balancer's region. For a multi-region service, create one load balancer per region and steer clients at the DNS layer.

Health Checks

Use HTTP_GET against a dedicated /healthz endpoint for HTTP services. TCP_CHECK only verifies the port accepts connections. A proper /healthz verifies the service can actually serve requests — database reachable, caches warm, dependencies up.

Don't point health checks at user-facing routes. User routes pull data, query databases, and can have variable latency. A dedicated health endpoint should be cheap, deterministic, and only fail when the service is actually unable to serve.

Match ConnectionTimeout to real response time, with headroom. A 3-second timeout is fine for fast endpoints but triggers false positives for services that occasionally hit 2-second tails. Measure your p99 health-check response time and add a margin.

Raise RetryCount in flaky environments. In environments with noisy networking or occasional GC pauses, RetryCount = 1 produces too many false-flap removals. Two or three retries smooths over transient blips without materially delaying real failure detection.

Leave HealthTreatFailure = 0 (hard fail) by default. Only switch to "treat all as healthy" for services where going dark is worse than serving potentially broken responses — and make sure you have alerting that fires on the condition, because the symptom won't be obvious from the service's perspective.

Operational Patterns

Rolling backend replacement.

  1. Add the new backend to the pool. Wait for it to go healthy.

  2. Set the old backend's weight to 0 (drain). Wait for active connections to drain.

  3. Remove the old backend.

Repeat per backend. No dropped connections, no client retries.

Maintenance on a single backend.

  1. Drop weight to 0. Confirm active connections reach zero.

  2. Take the VM down, apply changes, bring it back.

  3. Restore weight. Health checks will readmit it after one success.

Changing scheduler. Switching between algorithms reshuffles flow-to-backend mapping (except for mhmh-with-port, which is surgical). Schedule scheduler changes during off-peak if your service is sensitive to mid-flow resets.

Session persistence + scheduler. If you enable persistence on an wrr or lc listener, new flows from a client within the persistence window go to the same backend. When the window expires, a fresh scheduling decision may send the next flow to a different backend. Pick a timeout that spans the expected "session" lifetime from your service's perspective, not an arbitrary round number.

Troubleshooting

Match a symptom to its most common causes. Color shows which stage of the path to check first.

Symptom-to-cause map across the LB path

Last updated