Caddy behind a TCP load balancer - making sure SSL provisioning works on startup

Hello! I had the same problem as someone else trying to use Caddy behind a (Hetzner) load balancer, and wanted to share my solution.

The problem is that TCP load balancers can take a few seconds to spot that a service has been (re)started, and won’t forward traffic straight away - in my case it seemed to take 30-60s.

So if you start Caddy behind such a load balancer, with the LB forwarding ports 80 & 443, it might be that Caddy immediately starts trying to provision a certificate, and fails because the Let’s Encrypt servers can’t call back. If you’re unlucky, Caddy fails 3 times, trips the rate limit and you can’t try again for an hour.

As part of my app deployment, I was already creating a Caddy package based off library/caddy, so I made a couple of changes to make the startup reliable.

Firstly, I put this script into the new package, to stage the startup:

if [ ! -z "$CADDY_TEST_URL" ] ; then
  echo "Listening on port 80 and testing $CADDY_TEST_URL"
  ( printf "HTTP/1.1 200 OK\n\nPort 80 is reachable" | nc -l -p 80 -q 0 ) &
  while ! curl -s "$CADDY_TEST_URL" ; do 
  	sleep 1
  # race condition here
echo "Starting caddy"
exec caddy run --config /etc/caddy/Caddyfile --adapter caddyfile

Then my Dockerfile has this as a separate build step:

FROM library/caddy AS caddy

RUN chmod +x /
RUN /sbin/apk add netcat-openbsd curl

COPY Caddyfile /etc/caddy/Caddyfile
# this is just the static content from my app
COPY --from=app /srv/app/public /usr/share/caddy/
CMD ["/"]

So when I start caddy as part of my stack, the script listens on port 80 and probes the URL specified in $CADDY_TEST_URL - which is the external load balancer URL. This fails a few times until the external load balancer starts routing traffic. Then it starts Caddy, which can now provision certificates safely.

This has a theoretical race in it, but one that’s very unlikely to be tripped given the load balancer behaviour. Of course it’s a grubby shim, but I’m well into a Docker workflow so it barely registers :grinning:

Is there a better way of doing this? I could only see DNS auth, but I don’t use a supported DNS provider.

I don’t know whether Caddy could help out a bit more - e.g. it’d be lovely if it could self-test connectivity before reaching out to Let’s Encrypt. But even some control over timing & retries would be enough to make it reliable.

I couldn’t really follow the v2 docs in the way that I could for v1, so I wasn’t sure if I might be missing something - so any improvements / comments would be appreciated.

Caddy v2 uses safer logic for Let’s Encrypt than v1 – If the first attempt to issue a cert fails, it’ll then attempt issuance with LE’s staging environment instead, until it gets a good response, then finally try again with the live environment to get a real cert.

I’m thinking your shim might not be necessary with Caddy v2 now, compared to Caddy v1 (if I’m understanding, you’re just now upgrading from v1 to v2?)

This behaviour is described in this section of the docs:

Also, I moved your post to a different category, the Wiki category is meant for evergreen guides for using Caddy and such.

Yes! Thanks for the fast reply. I did just upgrade from v1 as it had been disappeared :slight_smile: and this was the first I’d had to work on it for a few months.

But I didn’t spot the logic had changed, and it sounds much smarter, thanks for the pointer. I will gingerly remove this hack.

1 Like

This topic was automatically closed after 30 days. New replies are no longer allowed.