1. The problem I’m having:
I’m using Caddy as a load balancer in front of Frankenphp (Caddy+Laravel octane) on a single machine to support zero downtime deployments via docker compose.
There are some unclear things related to caddy config reloads that only seem to work " by accident".
After trying many different approaches, I found a way blue-green deployment strategy that appears to work (tested with many thousands of concurrent requests during deployment).
TL;DR (full script at the bottom of the post)
- Bring up a new instance (green) with the new image.
- Once healthy, use admin API to update reverse_proxies to both blue (current image), and green.
- Send SIGTERM to blue, which causes upstream caddy/frankenphp to gracefully exist and reject new requests. Traffic now only goes to green via lb_policy round_robin.
- Bring up blue with the new image. Traffic now goes to two identical upstreams.
- Use admin API to update reverse_proxies to only the new green instance.
- Send SIGTERM to green (step 3.)
This works, but I don’t understand why 5. and 6. must happen in that order.
I banged my head against the wall trying with the more intuitive approach of
first gracefully shutting down the green instance, then removing it from the revere_proxy config.
Problem a) That leads to a lot of dropped requests,
Status code distribution:
[200] 280614 responses
[502] 113 responses
Example error from the reverse proxy when 5. and 6. are switched in order (gracefull shutdown, then update upstream)
/test
is the endpoint of the load test.
"bytes_read": 0,
is interesting.
On the other hand, first removing the upstream from the load balancer, and then gracefully shutting down the container works flawlessly every time.
Problem b) Under high load, during config reloading, caddy prints a lot of odd error messages that I can’t decipher.
2. Error messages and/or full log output:
Errors for problem a)
{
"level": "error",
"ts": 1732650117.984192,
"logger": "http.log.access.log0",
"msg": "handled request",
"request": {
"remote_ip": "172.21.0.1",
"remote_port": "60938",
"client_ip": "172.21.0.1",
"proto": "HTTP/1.1",
"method": "GET",
"host": "my.snicco.local",
"uri": "/test",
"headers": {
"User-Agent": [
"hey/0.0.1"
],
"Content-Type": [
"text/html"
],
"Accept-Encoding": [
"gzip"
]
},
"tls": {
"resumed": false,
"version": 772,
"cipher_suite": 4865,
"proto": "",
"server_name": "my.snicco.local"
}
},
"bytes_read": 0,
"user_id": "",
"duration": 0.004045749,
"size": 65,
"status": 502,
"resp_headers": {
"Server": [
"Caddy"
],
"Content-Type": [
"text/plain; charset=utf-8"
]
}
}
Errors for problem b) (weird errors)
{
"level": "error",
"ts": 1732651645.6192622,
"logger": "http.log",
"msg": "setting HTTP/3 Alt-Svc header",
"error": "no port can be announced, specify it explicitly using Server.Port or Server.Addr"
}
msg=HTTP/2 skipped because it requires TLS
3. Caddy version:
/srv # caddy version
v2.8.4 h1:q3pe0wpBj1OcHFZ3n/1nl4V4bxBrYoSoab7rL9BMYNk=
I also tried with the latest beta, (same issue).
/srv # caddy version
v2.9.0-beta.3 h1:tlqfbJMRNY6vnWwaQrnWrgS+wkDXr9GIFUD/P+HY9vA=
4. How I installed and ran Caddy:
a. System environment:
Linux alkan-122334539-dev
6.8.0-49-generic #49~22.04.1-Ubuntu SMP
PREEMPT_DYNAMIC
Wed Nov 6 17:42:15 UTC 2 x86_64 x86_64 x86_64 GNU/Linux
Docker version 27.3.1, build ce12230
Docker Compose version v2.29.7
b. Command:
Default caddy entry point of the official docker image.
FROM caddy:2.8.4-alpine
RUN apk --no-cache add curl
# Create a caddy user/group
RUN <<SHELL
set -euo pipefail
addgroup -S caddy && adduser -S -G caddy caddy &&
mkdir -p /data/caddy
mkdir -p /config/caddy
chown -R caddy:caddy /data/caddy
chown -R caddy:caddy /config/caddy
SHELL
COPY caddy-proxy.Caddyfile /etc/caddy/Caddyfile
USER caddy
c. Service/unit/compose file:
# Relevant parts from docker-compose.yaml
services:
caddy-proxy:
container_name: caddy-proxy
image: ghcr.io/snicco/my_snicco/caddy-proxy:${GIT_SHA:-latest}
restart: always
ports:
- "80:80" # HTTP => redirect to HTTPS
- "443:443" # HTTPS
- "443:443/udp" # HTTP/3
volumes:
- caddy_proxy_data:/data/caddy
- caddy_proxy_config:/config/caddy
build:
context: ../../infrastructure/docker/caddy-proxy
dockerfile: caddy-proxy.dockerfile
environment:
SERVER_NAME: https://${LARAVEL_APP_HOST:?}
CADDY_PROXY_GLOBAL_OPTIONS: ${CADDY_PROXY_GLOBAL_OPTIONS:-}
CADDY_PROXY_GLOBAL_LOG_LEVEL: ${CADDY_PROXY_GLOBAL_LOG_LEVEL:-warn}
CADDY_PROXY_SERVER_LOG_LEVEL: ${CADDY_PROXY_SERVER_LOG_LEVEL:-warn}
app-server-blue: &app-server
# a custom dockerfile based on frankenphp (laravel-octane+caddy)
app-server-green:
<<: *app-server
d. My complete Caddy config:
{
{$CADDY_PROXY_GLOBAL_OPTIONS}
# No reason to add certificates in local dev.
# Caddy only runs in docker.
skip_install_trust
# Set the global log level for all logs
log {
level {$CADDY_PROXY_GLOBAL_LOG_LEVEL}
output stderr
}
}
{$SERVER_NAME} {
reverse_proxy app-server-blue:80 {
# Allow Caddy to "queue" a request for up to 15 seconds before
# and retry every 250ms.
lb_try_duration 15s
lb_try_interval 250ms
# Always try the first proxy first
lb_policy round_robin
}
handle_errors {
@timeout {
expression {http.error.status_code} == 502
}
respond @timeout "The service is unavailable try again later - upstream unavailable" 502
}
log {
# Configurable log level (e.g., DEBUG, INFO, WARN, ERROR)
level {$CADDY_PROXY_SERVER_LOG_LEVEL}
# access logs to stdout, server logs to stderr.
output stdout
}
}
5. Links to relevant resources:
My deployment script:
#!/usr/bin/env bash
set -euo pipefail
CADDY="caddy-proxy"
BLUE="app-server-blue"
GREEN="app-server-green"
LARAVEL_HEALTHCHECK_ROUTE="health"
CADDY_ADMIN_ENDPOINT="http://localhost:2019"
CADDY_PATCH_UPSTREAM_ENDPOINT="$CADDY_ADMIN_ENDPOINT/config/apps/http/servers/srv0/routes/0/handle/0/routes/0/handle/0/upstreams"
START_UTC=$(date +%s)
heading() {
local YELLOW
local NC
YELLOW='\033[1;33m'
NC='\033[0m'
{
echo
echo -e "$YELLOW================================================================================$NC"
echo -e "$YELLOW $1 $NC"
echo -e "$YELLOW================================================================================$NC"
echo
}
}
prepare() {
# Start services on first deployment, if not already running.
bash dc.sh pull
bash dc.sh up "$CADDY" "$BLUE" --detach --no-recreate
}
acquire_lock() {
LOCKFILE="/tmp/zero-downtime-deploy.lock"
exec 200>"$LOCKFILE"
flock -n 200 || {
echo "ERROR: Another instance of the deployment script is running."
exit 1
}
echo "Acquired lock on: $LOCKFILE"
trap 'rm -f "$LOCKFILE"' EXIT
echo "Set trap to remove lock file on exit."
}
route_new_caddy_lb_requests_to() {
local targets=$1
local upstreams=""
case $targets in
"$BLUE")
upstreams="[{\"dial\":\"${BLUE}:80\"}]"
;;
"$GREEN")
upstreams="[{\"dial\":\"${GREEN}:80\"}]"
;;
"both")
upstreams="[{\"dial\":\"${GREEN}:80\"},{\"dial\":\"${BLUE}:80\"}]"
;;
*)
echo "Invalid target: $targets"
exit 1
;;
esac
echo "Routing traffic to $targets..."
bash dc.sh exec "$CADDY" curl -s --fail-with-body -H "Content-Type: application/json" -d "$upstreams" -X PATCH "$CADDY_PATCH_UPSTREAM_ENDPOINT"
echo "Traffic routed successfully."
}
wait_until_frankenphp_healthy() {
local container=$1
local i
for i in {1..30}; do
echo "Waiting for $container to become healthy... [$i/30]"
# Note: Don't put the retry logic inside curl, because docker exec could also fail,
# and would not be retried.
# Using curl from withing the caddy container is a good way to test that docker DNS is working
# as expected.
if bash dc.sh exec "$CADDY" curl -s --fail-with-body "http://$container:80/$LARAVEL_HEALTHCHECK_ROUTE"; then
return 0
fi
sleep 1
done
echo "ERROR: $container did not become healthy after 30 tries."
exit 1
}
start_frankenphp() {
local container=$1
bash dc.sh up \
--detach \
--no-deps \
"$container"
}
stop_frankenphp_gracefully() {
local container_name=$1
# Octane has a timeout of 30 seconds.
# This sends a SIGTERM, and after 60 seconds a SIGKILL.
bash dc.sh stop "$container_name" --timeout=60
bash dc.sh rm "$container_name" --force
}
heading "Acquiring deployment lock."
acquire_lock
# Step 1: Bring up initial state.
# Caddy is always expected to start
# with app-server-blue, and route traffic to it.
heading "Restoring/Creating initial services."
prepare
# Just in case there was a partial failure somewhere, and green is actually running,
# wait until blue is healthy before switching to blue only.
wait_until_frankenphp_healthy "$BLUE"
route_new_caddy_lb_requests_to "$BLUE"
# If there is still a green instance running for whatever reason for a previous deployment,
# stop it gracefully.
stop_frankenphp_gracefully "$GREEN"
# Step 2: Bring up green instance with the new image.
# The container does not yet receive requests.
heading "Bringing up green instance with the new image."
start_frankenphp "$GREEN"
# Once healthy, it's safe to route traffic to the green instance
# in addition to the blue instance.
wait_until_frankenphp_healthy "$GREEN"
# Now, both green (new code) and blue (old code) are running
# and will get requests from the caddy load balancer.
route_new_caddy_lb_requests_to "both"
# Step 3: Bring down blue instance (old code)
# Sending a SIGTERM to frankenphp will perform "request draining".
# Frankenphp will process all existing requests that have arrived,
# but will reject any new requests.
# The caddy LB will re-route all rejected requests to the
# green instance with the new code.
heading "Gracefully stopping blue instance (old code)"
stop_frankenphp_gracefully "$BLUE"
# Step 4: Restart blue instance with the new code.
heading "Restarting blue instance with the new code."
start_frankenphp "$BLUE"
wait_until_frankenphp_healthy "$BLUE"
# Step 5: Restore original routing configuration.
# Right now, we have two frankenphp instances running,
# both using 2X the number of CPUs (4x total).
# Stop giving traffic to the green instance.
heading "Re-routing traffic to blue instance."
route_new_caddy_lb_requests_to "$BLUE"
# Step 6: Bring down green instance to free up resources.
# Again, sending a SIGTERM to frankenphp will perform "request draining".
# But it's VERY IMPORTANT that we shut down
# the green instance AFTER the traffic switch to blue.
# Previously, we tried to shut it down before the traffic switch,
# which, in theory, should also work, but in practice, it did not
# and caused some requests to be dropped under high-load with a 502 error.
# My best guess is that caddy runs "two internal servers" on traffic switching,
# and the old one can't send requests to the green instance because
# it does not accept connections.
# But also, Caddy won't bounce them to the blue instance either (reason unknown).
heading "Gracefully stopping green instance."
stop_frankenphp_gracefully "$GREEN"
END_UTC=$(date +%s)
heading "[OK] Zero downtime deployment complete. Took $((END_UTC - START_UTC)) seconds."