Confusions about production ready zero downtime deployment

1. The problem I’m having:

I’m using Caddy as a load balancer in front of Frankenphp (Caddy+Laravel octane) on a single machine to support zero downtime deployments via docker compose.

There are some unclear things related to caddy config reloads that only seem to work " by accident".

After trying many different approaches, I found a way blue-green deployment strategy that appears to work (tested with many thousands of concurrent requests during deployment).

TL;DR (full script at the bottom of the post)

  1. Bring up a new instance (green) with the new image.
  2. Once healthy, use admin API to update reverse_proxies to both blue (current image), and green.
  3. Send SIGTERM to blue, which causes upstream caddy/frankenphp to gracefully exist and reject new requests. Traffic now only goes to green via lb_policy round_robin.
  4. Bring up blue with the new image. Traffic now goes to two identical upstreams.
  5. Use admin API to update reverse_proxies to only the new green instance.
  6. Send SIGTERM to green (step 3.)

This works, but I don’t understand why 5. and 6. must happen in that order.

I banged my head against the wall trying with the more intuitive approach of
first gracefully shutting down the green instance, then removing it from the revere_proxy config.

Problem a) That leads to a lot of dropped requests,

Status code distribution:
  [200] 280614 responses
  [502] 113 responses

Example error from the reverse proxy when 5. and 6. are switched in order (gracefull shutdown, then update upstream)

/test is the endpoint of the load test.

"bytes_read": 0, is interesting.

On the other hand, first removing the upstream from the load balancer, and then gracefully shutting down the container works flawlessly every time.

Problem b) Under high load, during config reloading, caddy prints a lot of odd error messages that I can’t decipher.

2. Error messages and/or full log output:

Errors for problem a)

{
  "level": "error",
  "ts": 1732650117.984192,
  "logger": "http.log.access.log0",
  "msg": "handled request",
  "request": {
    "remote_ip": "172.21.0.1",
    "remote_port": "60938",
    "client_ip": "172.21.0.1",
    "proto": "HTTP/1.1",
    "method": "GET",
    "host": "my.snicco.local",
    "uri": "/test",
    "headers": {
      "User-Agent": [
        "hey/0.0.1"
      ],
      "Content-Type": [
        "text/html"
      ],
      "Accept-Encoding": [
        "gzip"
      ]
    },
    "tls": {
      "resumed": false,
      "version": 772,
      "cipher_suite": 4865,
      "proto": "",
      "server_name": "my.snicco.local"
    }
  },
  "bytes_read": 0,
  "user_id": "",
  "duration": 0.004045749,
  "size": 65,
  "status": 502,
  "resp_headers": {
    "Server": [
      "Caddy"
    ],
    "Content-Type": [
      "text/plain; charset=utf-8"
    ]
  }
}

Errors for problem b) (weird errors)

{
  "level": "error",
  "ts": 1732651645.6192622,
  "logger": "http.log",
  "msg": "setting HTTP/3 Alt-Svc header",
  "error": "no port can be announced, specify it explicitly using Server.Port or Server.Addr"
}
msg=HTTP/2 skipped because it requires TLS

3. Caddy version:

/srv # caddy version
v2.8.4 h1:q3pe0wpBj1OcHFZ3n/1nl4V4bxBrYoSoab7rL9BMYNk=

I also tried with the latest beta, (same issue).

/srv # caddy version
v2.9.0-beta.3 h1:tlqfbJMRNY6vnWwaQrnWrgS+wkDXr9GIFUD/P+HY9vA=

4. How I installed and ran Caddy:

a. System environment:

Linux alkan-122334539-dev 
6.8.0-49-generic #49~22.04.1-Ubuntu SMP 
PREEMPT_DYNAMIC 
Wed Nov  6 17:42:15 UTC 2 x86_64 x86_64 x86_64 GNU/Linux
Docker version 27.3.1, build ce12230
Docker Compose version v2.29.7

b. Command:

Default caddy entry point of the official docker image.

FROM caddy:2.8.4-alpine

RUN apk --no-cache add curl

# Create a caddy user/group
RUN <<SHELL
set -euo pipefail

addgroup -S caddy && adduser -S -G caddy caddy &&
mkdir -p /data/caddy
mkdir -p /config/caddy
chown -R caddy:caddy /data/caddy
chown -R caddy:caddy /config/caddy
SHELL

COPY caddy-proxy.Caddyfile /etc/caddy/Caddyfile

USER caddy

c. Service/unit/compose file:

# Relevant parts from docker-compose.yaml
services:
  caddy-proxy:
    container_name: caddy-proxy
    image: ghcr.io/snicco/my_snicco/caddy-proxy:${GIT_SHA:-latest}
    restart: always
    ports:
      - "80:80" # HTTP => redirect to HTTPS
      - "443:443" # HTTPS
      - "443:443/udp" # HTTP/3
    volumes:
      - caddy_proxy_data:/data/caddy
      - caddy_proxy_config:/config/caddy
    build:
      context: ../../infrastructure/docker/caddy-proxy
      dockerfile: caddy-proxy.dockerfile
    environment:
      SERVER_NAME: https://${LARAVEL_APP_HOST:?}
      CADDY_PROXY_GLOBAL_OPTIONS: ${CADDY_PROXY_GLOBAL_OPTIONS:-}
      CADDY_PROXY_GLOBAL_LOG_LEVEL: ${CADDY_PROXY_GLOBAL_LOG_LEVEL:-warn}
      CADDY_PROXY_SERVER_LOG_LEVEL: ${CADDY_PROXY_SERVER_LOG_LEVEL:-warn}

  app-server-blue: &app-server
  # a custom dockerfile based on frankenphp (laravel-octane+caddy)

  app-server-green:
  <<: *app-server

d. My complete Caddy config:

{
        {$CADDY_PROXY_GLOBAL_OPTIONS}

        # No reason to add certificates in local dev.
        # Caddy only runs in docker.
        skip_install_trust

        # Set the global log level for all logs
        log {
                level {$CADDY_PROXY_GLOBAL_LOG_LEVEL}
                output stderr
        }
}

{$SERVER_NAME} {
        reverse_proxy app-server-blue:80 {
                # Allow Caddy to "queue" a request for up to 15 seconds before
                # and retry every 250ms.
                lb_try_duration 15s
                lb_try_interval 250ms
                # Always try the first proxy first
                lb_policy round_robin
        }

        handle_errors {
                @timeout {
                        expression {http.error.status_code} == 502
                }
                respond @timeout "The service is unavailable try again later - upstream unavailable" 502
        }

        log {
                # Configurable log level (e.g., DEBUG, INFO, WARN, ERROR)
                level {$CADDY_PROXY_SERVER_LOG_LEVEL}

                # access logs to stdout, server logs to stderr.
                output stdout
        }
}

5. Links to relevant resources:

My deployment script:

#!/usr/bin/env bash

set -euo pipefail

CADDY="caddy-proxy"
BLUE="app-server-blue"
GREEN="app-server-green"
LARAVEL_HEALTHCHECK_ROUTE="health"
CADDY_ADMIN_ENDPOINT="http://localhost:2019"
CADDY_PATCH_UPSTREAM_ENDPOINT="$CADDY_ADMIN_ENDPOINT/config/apps/http/servers/srv0/routes/0/handle/0/routes/0/handle/0/upstreams"
START_UTC=$(date +%s)

heading() {
  local YELLOW
  local NC
  YELLOW='\033[1;33m'
  NC='\033[0m'
  {
    echo
    echo -e "$YELLOW================================================================================$NC"
    echo -e "$YELLOW $1 $NC"
    echo -e "$YELLOW================================================================================$NC"
    echo
  }
}

prepare() {
  # Start services on first deployment, if not already running.
  bash dc.sh pull
  bash dc.sh up "$CADDY" "$BLUE" --detach --no-recreate
}

acquire_lock() {
  LOCKFILE="/tmp/zero-downtime-deploy.lock"
  exec 200>"$LOCKFILE"
  flock -n 200 || {
      echo "ERROR: Another instance of the deployment script is running."
      exit 1
  }
  echo "Acquired lock on: $LOCKFILE"
  trap 'rm -f "$LOCKFILE"' EXIT
  echo "Set trap to remove lock file on exit."
}

route_new_caddy_lb_requests_to() {
  local targets=$1
  local upstreams=""

  case $targets in
  "$BLUE")
    upstreams="[{\"dial\":\"${BLUE}:80\"}]"
    ;;
  "$GREEN")
    upstreams="[{\"dial\":\"${GREEN}:80\"}]"
    ;;
  "both")
    upstreams="[{\"dial\":\"${GREEN}:80\"},{\"dial\":\"${BLUE}:80\"}]"
    ;;
  *)
    echo "Invalid target: $targets"
    exit 1
    ;;
  esac

  echo "Routing traffic to $targets..."
  bash dc.sh exec "$CADDY" curl -s --fail-with-body -H "Content-Type: application/json" -d "$upstreams" -X PATCH "$CADDY_PATCH_UPSTREAM_ENDPOINT"
  echo "Traffic routed successfully."
}

wait_until_frankenphp_healthy() {
  local container=$1
  local i
  for i in {1..30}; do
    echo "Waiting for $container to become healthy... [$i/30]"
    # Note: Don't put the retry logic inside curl, because docker exec could also fail,
    # and would not be retried.
    # Using curl from withing the caddy container is a good way to test that docker DNS is working
    # as expected.
    if bash dc.sh exec "$CADDY" curl -s --fail-with-body "http://$container:80/$LARAVEL_HEALTHCHECK_ROUTE"; then
      return 0
    fi
    sleep 1
  done
  echo "ERROR: $container did not become healthy after 30 tries."
  exit 1
}

start_frankenphp() {
  local container=$1
  bash dc.sh up \
    --detach \
    --no-deps \
    "$container"
}

stop_frankenphp_gracefully() {
  local container_name=$1
  # Octane has a timeout of 30 seconds.
  # This sends a SIGTERM, and after 60 seconds a SIGKILL.
  bash dc.sh stop "$container_name" --timeout=60
  bash dc.sh rm "$container_name" --force
}

heading "Acquiring deployment lock."
acquire_lock

# Step 1: Bring up initial state.
# Caddy is always expected to start
# with app-server-blue, and route traffic to it.
heading "Restoring/Creating initial services."
prepare
# Just in case there was a partial failure somewhere, and green is actually running,
# wait until blue is healthy before switching to blue only.
wait_until_frankenphp_healthy "$BLUE"
route_new_caddy_lb_requests_to "$BLUE"
# If there is still a green instance running for whatever reason for a previous deployment,
# stop it gracefully.
stop_frankenphp_gracefully "$GREEN"

# Step 2: Bring up green instance with the new image.
# The container does not yet receive requests.
heading "Bringing up green instance with the new image."
start_frankenphp "$GREEN"
# Once healthy, it's safe to route traffic to the green instance
# in addition to the blue instance.
wait_until_frankenphp_healthy "$GREEN"
# Now, both green (new code) and blue (old code) are running
# and will get requests from the caddy load balancer.
route_new_caddy_lb_requests_to "both"

# Step 3: Bring down blue instance (old code)
# Sending a SIGTERM to frankenphp will perform "request draining".
# Frankenphp will process all existing requests that have arrived,
# but will reject any new requests.
# The caddy LB will re-route all rejected requests to the
# green instance with the new code.
heading "Gracefully stopping blue instance (old code)"
stop_frankenphp_gracefully "$BLUE"

# Step 4: Restart blue instance with the new code.
heading "Restarting blue instance with the new code."
start_frankenphp "$BLUE"
wait_until_frankenphp_healthy "$BLUE"

# Step 5: Restore original routing configuration.
# Right now, we have two frankenphp instances running,
# both using 2X the number of CPUs (4x total).
# Stop giving traffic to the green instance.
heading "Re-routing traffic to blue instance."
route_new_caddy_lb_requests_to "$BLUE"

# Step 6: Bring down green instance to free up resources.
# Again, sending a SIGTERM to frankenphp will perform "request draining".
# But it's VERY IMPORTANT that we shut down
# the green instance AFTER the traffic switch to blue.
# Previously, we tried to shut it down before the traffic switch,
# which, in theory, should also work, but in practice, it did not
# and caused some requests to be dropped under high-load with a 502 error.
# My best guess is that caddy runs "two internal servers" on traffic switching,
# and the old one can't send requests to the green instance because
# it does not accept connections.
# But also, Caddy won't bounce them to the blue instance either (reason unknown).
heading "Gracefully stopping green instance."
stop_frankenphp_gracefully "$GREEN"

END_UTC=$(date +%s)
heading "[OK] Zero downtime deployment complete. Took $((END_UTC - START_UTC)) seconds."

Reproducer


I was able to reproduce the same issue by switching frankenphp with just another caddy image.

Deployment script (running steps in the intuitive order, but not working without 502)

#!/usr/bin/env bash

set -euo pipefail

CADDY="caddy-proxy"
BLUE="app-server-blue"
GREEN="app-server-green"
LARAVEL_HEALTHCHECK_ROUTE="/"
CADDY_ADMIN_ENDPOINT="http://localhost:2019"
CADDY_PATCH_UPSTREAM_ENDPOINT="$CADDY_ADMIN_ENDPOINT/config/apps/http/servers/srv0/routes/0/handle/0/routes/0/handle/0/upstreams"
START_UTC=$(date +%s)

heading() {
  local YELLOW
  local NC
  YELLOW='\033[1;33m'
  NC='\033[0m'
  {
    echo
    echo -e "$YELLOW================================================================================$NC"
    echo -e "$YELLOW $1 $NC"
    echo -e "$YELLOW================================================================================$NC"
    echo
  }
}

prepare() {
  # Start services on first deployment, if not already running.
  bash dc.sh up "$CADDY" "$BLUE" --detach --no-recreate
}

acquire_lock() {
  LOCKFILE="/tmp/zero-downtime-deploy.lock"
  exec 200>"$LOCKFILE"
  flock -n 200 || {
      echo "ERROR: Another instance of the deployment script is running."
      exit 1
  }
  echo "Acquired lock on: $LOCKFILE"
  trap 'rm -f "$LOCKFILE"' EXIT
  echo "Set trap to remove lock file on exit."
}

route_new_caddy_lb_requests_to() {
  local targets=$1
  local upstreams=""

  case $targets in
  "$BLUE")
    upstreams="[{\"dial\":\"${BLUE}:80\"}]"
    ;;
  "$GREEN")
    upstreams="[{\"dial\":\"${GREEN}:80\"}]"
    ;;
  "both")
    upstreams="[{\"dial\":\"${GREEN}:80\"},{\"dial\":\"${BLUE}:80\"}]"
    ;;
  *)
    echo "Invalid target: $targets"
    exit 1
    ;;
  esac

  echo "Routing traffic to $targets..."
  bash dc.sh exec "$CADDY" curl -s --fail-with-body -H "Content-Type: application/json" -d "$upstreams" -X PATCH "$CADDY_PATCH_UPSTREAM_ENDPOINT"
  echo "Traffic routed successfully."
}

wait_until_frankenphp_healthy() {
  local container=$1
  local i
  for i in {1..30}; do
    echo "Waiting for $container to become healthy... [$i/30]"
    # Note: Don't put the retry logic inside curl, because docker exec could also fail,
    # and would not be retried.
    # Using curl from withing the caddy container is a good way to test that docker DNS is working
    # as expected.
    if bash dc.sh exec "$CADDY" curl -s --fail-with-body "http://$container:80$LARAVEL_HEALTHCHECK_ROUTE"; then
      return 0
    fi
    sleep 1
  done
  echo "ERROR: $container did not become healthy after 30 tries."
  exit 1
}

start_frankenphp() {
  local container=$1
  bash dc.sh up \
    --detach \
    --no-deps \
    "$container"
}

stop_frankenphp_gracefully() {
  local container_name=$1
  # Octane has a timeout of 30 seconds.
  # This sends a SIGTERM, and after 60 seconds a SIGKILL.
  bash dc.sh stop "$container_name" --timeout=60
  bash dc.sh rm "$container_name" --force
}

heading "Acquiring deployment lock."
acquire_lock

# Step 1: Bring up initial state.
# Caddy is always expected to start
# with app-server-blue, and route traffic to it.
heading "Restoring/Creating initial services."
prepare
# Just in case there was a partial failure somewhere, and green is actually running,
# wait until blue is healthy before switching to blue only.
wait_until_frankenphp_healthy "$BLUE"
route_new_caddy_lb_requests_to "$BLUE"
# If there is still a green instance running for whatever reason for a previous deployment,
# stop it gracefully.
stop_frankenphp_gracefully "$GREEN"

# Step 2: Bring up green instance with the new image.
# The container does not yet receive requests.
heading "Bringing up green instance with the new image."
start_frankenphp "$GREEN"
# Once healthy, it's safe to route traffic to the green instance
# in addition to the blue instance.
wait_until_frankenphp_healthy "$GREEN"
# Now, both green (new code) and blue (old code) are running
# and will get requests from the caddy load balancer.
route_new_caddy_lb_requests_to "both"

# Step 3: Bring down blue instance (old code)
# Sending a SIGTERM to frankenphp will perform "request draining".
# Frankenphp will process all existing requests that have arrived,
# but will reject any new requests.
# The caddy LB will re-route all rejected requests to the
# green instance with the new code.
heading "Gracefully stopping blue instance (old code)"
stop_frankenphp_gracefully "$BLUE"

# Step 4: Restart blue instance with the new code.
heading "Restarting blue instance with the new code."
start_frankenphp "$BLUE"
wait_until_frankenphp_healthy "$BLUE"

# Step 5: Restore original routing configuration.
# Right now, we have two frankenphp instances running,
# both using 2X the number of CPUs (4x total).
# Stop giving traffic to the green instance.
# heading "Re-routing traffic to blue instance."
#route_new_caddy_lb_requests_to "$BLUE"

# Step 6: Bring down green instance to free up resources.
# Again, sending a SIGTERM to frankenphp will perform "request draining".
# But it's VERY IMPORTANT that we shut down
# the green instance AFTER the traffic switch to blue.
# Previously, we tried to shut it down before the traffic switch,
# which, in theory, should also work, but in practice, it did not
# and caused some requests to be dropped under high-load with a 502 error.
# My best guess is that caddy runs "two internal servers" on traffic switching,
# and the old one can't send requests to the green instance because
# it does not accept connections.
# But also, Caddy won't bounce them to the blue instance either (reason unknown).
heading "Gracefully stopping green instance."
stop_frankenphp_gracefully "$GREEN"

route_new_caddy_lb_requests_to "$BLUE"

END_UTC=$(date +%s)
heading "[OK] Zero downtime deployment complete. Took $((END_UTC - START_UTC)) seconds."

Docker compose

volumes:
  caddy_proxy_data:
  caddy_proxy_config:

services:
  caddy-proxy:
    container_name: caddy-proxy
    image: ghcr.io/reproducer/my_reproducer/caddy-proxy:${GIT_SHA:-latest}
    restart: always
    ports:
      - "80:80" # HTTP => redirect to HTTPS
      - "443:443" # HTTPS
      - "443:443/udp" # HTTP/3
    volumes:
      - caddy_proxy_data:/data/caddy
      - caddy_proxy_config:/config/caddy
    build:
      context: .
      dockerfile: caddy-proxy.dockerfile
    environment:
      SERVER_NAME: https://${LARAVEL_APP_HOST:?}
      CADDY_PROXY_GLOBAL_OPTIONS: ${CADDY_PROXY_GLOBAL_OPTIONS:-}
      CADDY_PROXY_GLOBAL_LOG_LEVEL: ${CADDY_PROXY_GLOBAL_LOG_LEVEL:-warn}
      CADDY_PROXY_SERVER_LOG_LEVEL: ${CADDY_PROXY_SERVER_LOG_LEVEL:-warn}

  app-server-blue: &app-server
    container_name: app-server-blue
    image: caddy:2.8.4-alpine

  app-server-green:
    <<: *app-server
    container_name: app-server-green    

Have you tried setting lb_retries to a positive integer? Without it, when Caddy hits a bad upstream it won’t hold the request to try an alternative upstream. It kinda sounds from your description like you want to be able to kill the upstream BEFORE removing it from Caddy’s config, and have Caddy adapt automatically, and I think this would be required for that.

  • lb_retries is how many times to retry selecting available backends for each request if the next available host is down. By default, retries are disabled (zero).
    If lb_try_duration is also configured, then retries may stop early if the duration is reached. In other words, the retry duration takes precedence over the retry count.

https://caddyserver.com/docs/caddyfile/directives/reverse_proxy#lb_retries

You might also want to set a conservative fail_duration as well so that Caddy avoids the bad backend for a short period after hitting it.

3 Likes

Good point.

In other words, the retry duration takes precedence over the retry count.

I have not, I understood it as: "you don’t need lb_retries if lb_try_duration is set.

I’ll experiment with that as well.

Yes, correct, at least that sounds more intuitive or rather less “magic I don’t understand how it works :smiley:

If I stop an upstream frankenphp server with docker stop, it sends a SIGTERM, which means frankenphp will process all in-progress requests, but immediately reject all new ones.

After that happened, it’s “safe” to remove it from the proxy config, at least in my mind.

But somehow, the opposite is required.

Hmm. I haven’t actually tinkered with it much myself. Maybe it wouldn’t hurt to have the docs be a little more explicit about it because that doesn’t seem like an uncommon conclusion to draw.

That said, looking closely, I take its implication to be the opposite: that lb_retries might be configured without lb_try_duration, but if they both are configured, the lb_try_duration acts as a hard cutoff. i.e. Caddy will try to make X retries as configured, but won’t exceed the configured duration.

I feel like good policy for cutting over would be to start routing requests to the new upstream, and when it proves healthy, stop sending requests to the old upstream, then turn the old upstream off as the very last part of the process.

There’s no way to avoid some kind of service impact if you just… turn off one upstream without telling Caddy first. Either you’ll get 502s, or with the load balancer configured to adapt, you’ll see at MINIMUM some minor impact to quality of service (delays while Caddy determines the bad upstream, rejects it, and moves to a good upstream).

By updating Caddy’s routing first, you effectively eliminate all service impact entirely.

3 Likes

Good point, I had actually come to the same thought after posting this and experimented with this:

When I temporarily scale up to two instances (after the new image is deemed healthy),

I put the “new image” first, via the API, so the lb_policy=first makes a lot of sense here to send as few requests as possible to the old image.

reverse_proxy {$CADDY_PROXY_SERVER_DEFAULT_UPSTREAM} {
    #
    # What this means:
    # ========================================================
    #
    # dial_timeout: 500ms => Timeout for establishing a connection to the backend. We're on the same network, so this should be fast.
    # lb_policy: first => Always route to the first healthy upstream. If the first fails, fallback to other upstreams if available.
    # fail_duration + max_fails => If 3 requests fail within 5s, the upstream is considered unhealthy and caddy will not try again for another 5s.
    # try_interval: 250ms => Try find a new upstream every 250ms.
    # lb_try_duration: 15 => If we can't find an upstream in 15s, the request will fail with a 502 error.
    #
    # Why it works for our use case of blue/green deployments:
    # ========================================================
    #
    # During normal operation, we have just one upstream,
    # so the lb_policy is irrelevant.
    # If the "blue instance" is restarted for whatever reason,
    # we will get a reasonable amount of retries while docker tries to restart the container.
    # Caddy will "hold on" to the request for 15s total.
    #
    # During a zero downtime deployment,
    # we temporarily have two upstream running at the same time (updated via API).
    # The first one (from left to right) will be the "new" code, so we want to route as many requests
    # as possible to it, instead of using a round_robin policy.
    # If the new container is unhealthy, caddy will still fallback to the old code.
    #
    transport http {
       dial_timeout 500ms
    }
    lb_policy first
    fail_duration 5s
    max_fails 3
    lb_try_interval 250ms
    lb_try_duration 15s
    # lb_reties: 10, Also seems to be needed? 
    # Maybe? https://caddy.community/t/confusions-about-production-ready-zero-downtime-deployment/26531/5
}

If you’re using lb_policy first you definitely want to include a fail_duration of at least a few seconds so that if the new deployment is unhealthy, Caddy can actually mark it so and enable requests to be routed to the fallback.

Ninja edit: you’ve actually got 5s there already. That looks pretty good to me!

3 Likes

a) Even with my adjustments, I think this makes sense, though.

It’s an edge case, but still:

  • Old instance is shutdown, but still in Caddy (just not the first)
  • Then, for some reason, the new (now active) container is unhealthy again, now you got two offline upstreams. One is temporary (the new one, docker will restart it on exit), and the other one is gone for good (received a SIGTERM)

If you cut off traffic first, at least you only have one temporary down upstream?


b)

:+1:t3: Now, the only other question is wheter lb_retries is required aswell.

Tbh, I don’t care how often it’s retried, I’d rather think in terms of “total duration” and “retry interval”, which then implicitly has a number of retries.

lb_retries seems like a redundant option to me?

Not necessarily redundant since it performs a different function; one sets a limit on time, the other sets a limit on attempts, and whichever limit is hit first stops the retries.

But, yeah, looking into it, it doesn’t seem like it should be necessary for retries to function at all.

I gave it a quick try with a very basic Caddyfile:

{
  debug
}

http:// {
  reverse_proxy localhost:8081 localhost:8082 {
    lb_policy first
    lb_try_duration 15s
    fail_duration 5s
  }
}

http://:8081 {
  abort
}

http://:8082 {
  respond "8082"
}

I found that the first request every 5 seconds had a bit of a blip but ultimately responded 8082, and then requests were very smooth for 5 seconds (as expected), then another blip. No 502 errors.

Debug logs indicated upstream roundtrips to :8081 that were EOF’d (as configured abort) but produced no actual error-level logs as Caddy moved smoothly back to upstream selection and dialled :8082 and got a healthy response.

2 Likes

That could be from the retry frequency no?

I don’t remember, but I think the default is 250ms.

Anyhow, I’m fine with a slight delay here and there as this is only relevant during a couple of seconds in a deployment when both get traffic.

It seems to be all working, tested with thousands of concurrent requests and none are dropped.
I appreciate your help!

The only remainign thing is this weird http3 header thing, that still keeps popping up on every config reload.

1 Like

I did find something about that here on these forums:

Does that sound like it might be relevant?

Yes, I saw that one. But that does not help much since the LB listens in default 80/443 and a “normal” TLD.

It only happens during the config reload though where requests are still comming in.

Requests don’t fail though (so error as a log level is also a bit confusing)

It is an error if you expect/require clients to use HTTP/3.

It shouldn’t impede HTTP/2, though, as you note.

I wonder why it’s happening in your case…

Yeah, who knows :smiley:

I don’t, it’s optional. But sounds like something is wrong during the reloads, oh well.

1 Like

That’s correct. lb_retries is a flat amount of maximum retries, lb_try_duration is a time limit for retries. If both are configured, the time limit may cut off the flat amount of retries early. Don’t need both, the duration is usually enough, and yes it does hold the request for that long, as long as the errors are retryable errors (e.g. dial errors).

2 Likes

Thanks for confirming that.


# Set a static route for the proxy to see if it's up
respond /proxy-ping "proxy pong" 200

reverse_proxy {$CADDY_PROXY_SERVER_DEFAULT_UPSTREAM} {
                #
                # What this means:
                # ========================================================
                #
                # dial_timeout: 500ms => Timeout for establishing a connection to the backend. We're on the same network, so this should be fast.
                # lb_policy: first => Always route to the first healthy upstream. If the first fails, fallback to other upstreams if available.
                # fail_duration + max_fails => If 3 requests fail within 5s, the upstream is considered unhealthy and caddy will not try again for another 5s.
                # try_interval: 250ms => Try find a new upstream every 250ms.
                # lb_try_duration: 15 => If we can't find an upstream in 15s, the request will fail with a 502 error.
                #
                # Why it works for our use case of blue/green deployments:
                # ========================================================
                #
                # During normal operation, we have just one upstream,
                # so the lb_policy is irrelevant.
                # If the "blue instance" is restarted for whatever reason,
                # we will get a reasonable amount of retries while docker tries to restart the container.
                # Caddy will "hold on" to the request for 15s total.
                #
                # During a zero downtime deployment,
                # we temporarily have two upstream running at the same time (updated via API).
                # The first one (from left to right) will be the "new" code, so we want to route as many requests
                # as possible to it, instead of using a round_robin policy.
                # If the new container is unhealthy, caddy will still fallback to the old code.
                #
                transport http {
                   dial_timeout 500ms
                }
                lb_policy first
                fail_duration 5s
                max_fails 3
                lb_try_interval 250ms
                lb_try_duration 15s
        }

Is this configuration unfit for high connection concurrency?

On my local machine, a load test to /proxy-ping handles 10k concurrent connections without breaking a sweat.

my.snicco.io# wrk --latency https://my.snicco.local/proxy-ping -d20s -c10000 -t8 -T30
Running 20s test @ https://my.snicco.local/proxy-ping
  8 threads and 10000 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   176.95ms  285.70ms   2.94s    85.46%
    Req/Sec    26.29k     3.76k   66.57k    83.70%
  Latency Distribution
     50%   19.90ms
     75%  287.30ms
     90%  547.72ms
     99%    1.34s 
  3595473 requests in 20.10s, 545.20MB read
Requests/sec: 178845.73
Transfer/sec:     27.12MB

If I set the upstream to a docker image running caddy:latest serving the default caddy works page, the concurrent connection count craps out completely, even at 500concurrent connections (20x less)
with many 502 responses.

(Both docker containers run in the same docker network on the same host)

my.snicco.io# wrk --latency https://my.snicco.local -d20s -c500 -t8
Running 20s test @ https://my.snicco.local
  8 threads and 500 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    17.75ms   17.20ms 327.10ms   88.81%
    Req/Sec     2.79k     1.71k    6.61k    64.45%
  Latency Distribution
     50%   13.03ms
     75%   23.30ms
     90%   36.45ms
     99%   77.98ms
  100595 requests in 20.07s, 1.77GB read
  Socket errors: connect 0, read 0, write 0, timeout 861
  Non-2xx or 3xx responses: 46
Requests/sec:   5012.67
Transfer/sec:     90.30MB

Is that to be expected, or is something completely wrong here?

Okay, this sent me down a wild goose chase.

It appears that the issue only pops up when proxying to a docker container inside a bride network.

I tried every possible combination below as well,

  • caddy lb (host) => caddy (host), no issues
  • caddy lb (host) => caddy (docker with host network), no issues
  • caddy lb (docker with host network) => caddy (docker with host network), no issues

They all work fine.

But if a docker container is inside a bride network, a lot of 502’s are generated.

{
  "level": "error",
  "ts": 1733504275.8736153,
  "logger": "http.log.error",
  "msg": "dial tcp 172.23.0.2:8080: connect: cannot assign requested address",
  "request": {
    "remote_ip": "172.23.0.1",
    "remote_port": "20352",
    "client_ip": "172.23.0.1",
    "proto": "HTTP/1.1",
    "method": "GET",
    "host": "localhost:9090",
    "uri": "/",
    "headers": {}
  },
  "duration": 0.172479466,
  "status": 502,
  "err_id": "1rnuiz3wm",
  "err_trace": "reverseproxy.statusError (reverseproxy.go:1269)"
}

I think this issue here is with docker networking, but it surfaces because caddy has a default of 32 for keepalive_idle_conns_per_host

Increasing it to 1k solves the errors, but feels like a hack for sure.

{
    admin off
    auto_https off
}

:9090 {
    reverse_proxy upstream:8080 {
        transport http {
            versions h1
           #  keepalive_idle_conns_per_host 1000
        }
    }
}

What might also be the case is that I run out of ports because I run wrk on the same host as the stack to be benchmarked.

Is using docker bride networks a known issue?