TLS certs inaccessible and/or being reissued

I have a multi-tenant SaaS with ~400 domains which has been working amazing for years. Caddy data is stored to a managed Redis DB, as was recommended to me in this forum previously.

1. The problem I’m having:

Recently, it seems that the certificates need to be reissued every time my Caddy container is rebuilt/restarted, as if the certs are stored locally in the Docker container rather than in an offsite persistent location.

I see the caddy-tlsredis Caddy extension I’m using is now deprecated, and while I only noticed this issue today, I’m not actually sure when the problem started. Hosting only a few hundred low-traffic domains means that it’s possible Caddy has been quickly and quietly reissuing all ~400 certificates every time my Caddy container has been restarted, without hitting rate limits.

Today, though, I rebuilt the Caddy server a few times in a row, likely stacking up my CA requests, hitting the rate limits, and making this noticeable.

  1. I know it’s a third-party extension, but does anyone know if caddy-tlsredis stopped working?
  2. Would upgrading to caddy-storage-redis fix the issue?
  3. Did I never actually have persistent storage setup correctly in the first place?

2. Error messages and/or full log output:

Here’s a log entry when a domain is inaccessible, even though the cert was previously issued and saved to Redis. I’m unsure which rate limiter I’m hitting. There’s no mention of LE or ZeroSSL, so maybe it’s the internal limiter?

2024/04/06 03:04:09.005	DEBUG	events	event	{"name": "tls_get_certificate", "id": "41a4302a-ba3c-4eb6-8653-ac19a33e9546", "origin": "tls", "data": {"client_hello":{"CipherSuites":[49195,49199,49196,49200,52393,52392,49161,49171,49162,49172,156,157,47,53,49170,10,4865,4866,4867],"ServerName":"www.moonrovrland.com","SupportedCurves":[29,23,24,25],"SupportedPoints":"AA==","SignatureSchemes":[2052,1027,2055,2053,2054,1025,1281,1537,1283,1539,513,515],"SupportedProtos":null,"SupportedVersions":[772,771],"Conn":{}}}}
2024/04/06 03:04:09.005	DEBUG	tls.handshake	no matching certificates and no custom selection logic	{"identifier": "www.moonrovrland.com"}
2024/04/06 03:04:09.005	DEBUG	tls.handshake	no matching certificates and no custom selection logic	{"identifier": "*.moonrovrland.com"}
2024/04/06 03:04:09.005	DEBUG	tls.handshake	no matching certificates and no custom selection logic	{"identifier": "*.*.com"}
2024/04/06 03:04:09.005	DEBUG	tls.handshake	no matching certificates and no custom selection logic	{"identifier": "*.*.*"}
2024/04/06 03:04:09.011	DEBUG	tls	response from ask endpoint	{"domain": "www.moonrovrland.com", "url": "http://dashboard:3000/api/domain.check?domain=www.moonrovrland.com", "status": 200}
2024/04/06 03:04:09.011	DEBUG	http.stdlib	http: TLS handshake error from 10.124.0.6:12950: certificate is not allowed for server name www.moonrovrland.com: decision func: on-demand rate limit exceeded

And when the request for the domain is made some time later, presumably when the rate limiter catches up:

2024/04/06 03:06:41.775	DEBUG	events	event	{"name": "tls_get_certificate", "id": "bfc43e86-c1df-40f5-a753-3ace00552269", "origin": "tls", "data": {"client_hello":{"CipherSuites":[35466,4865,4866,4867,49195,49199,49196,49200,52393,52392,49171,49172,156,157,47,53],"ServerName":"www.moonrovrland.com","SupportedCurves":[31354,29,23,24],"SupportedPoints":"AA==","SignatureSchemes":[1027,2052,1025,1283,2053,1281,2054,1537],"SupportedProtos":["h2","http/1.1"],"SupportedVersions":[23130,772,771],"Conn":{}}}}
2024/04/06 03:06:41.776	DEBUG	tls.handshake	no matching certificates and no custom selection logic	{"identifier": "www.moonrovrland.com"}
2024/04/06 03:06:41.776	DEBUG	tls.handshake	no matching certificates and no custom selection logic	{"identifier": "*.moonrovrland.com"}
2024/04/06 03:06:41.776	DEBUG	tls.handshake	no matching certificates and no custom selection logic	{"identifier": "*.*.com"}
2024/04/06 03:06:41.776	DEBUG	tls.handshake	no matching certificates and no custom selection logic	{"identifier": "*.*.*"}
2024/04/06 03:06:41.782	DEBUG	tls	response from ask endpoint	{"domain": "www.moonrovrland.com", "url": "http://dashboard:3000/api/domain.check?domain=www.moonrovrland.com", "status": 200}
2024/04/06 03:06:41.782	DEBUG	tls.handshake	all external certificate managers yielded no certificates and no errors	{"remote_ip": "10.124.0.6", "remote_port": "18436", "sni": "www.moonrovrland.com"}
2024/04/06 03:06:41.786	DEBUG	tls	loading managed certificate	{"domain": "www.moonrovrland.com", "expiration": "2024/05/20 04:00:07.000", "issuer_key": "acme-v02.api.letsencrypt.org-directory", "storage": "{\"address\":\"REDACTED\",\"host\":\"REDACTED\",\"port\":\"REDACTED\",\"db\":0,\"username\":\"default\",\"password\":\"REDACTED\",\"timeout\":5,\"key_prefix\":\"caddytls\",\"value_prefix\":\"caddy-storage-redis\",\"aes_key\":\"\",\"tls_enabled\":true,\"tls_insecure\":true}"}
2024/04/06 03:06:41.914	DEBUG	tls.cache	added certificate to cache	{"subjects": ["www.moonrovrland.com"], "expiration": "2024/05/20 04:00:07.000", "managed": true, "issuer_key": "acme-v02.api.letsencrypt.org-directory", "hash": "7a8eb46e1ba4f8f5924ae929f18dc7b45f785f1412e20bc4421fde94230afd21", "cache_size": 532, "cache_capacity": 10000}
2024/04/06 03:06:41.914	DEBUG	events	event	{"name": "cached_managed_cert", "id": "8d005111-9aa5-4adb-9a80-0e2de4262688", "origin": "tls", "data": {"sans":["www.moonrovrland.com"]}}
2024/04/06 03:06:41.914	DEBUG	tls.handshake	loaded certificate from storage	{"remote_ip": "10.124.0.6", "remote_port": "18436", "subjects": ["www.moonrovrland.com"], "managed": true, "expiration": "2024/05/20 04:00:07.000", "hash": "7a8eb46e1ba4f8f5924ae929f18dc7b45f785f1412e20bc4421fde94230afd21"}
2024/04/06 03:06:41.935	DEBUG	http.handlers.reverse_proxy	selected upstream	{"dial": "website:3000", "total_upstreams": 1}
2024/04/06 03:06:42.050	DEBUG	http.handlers.reverse_proxy	upstream roundtrip	{"upstream": "website:3000", "duration": 0.114481533, "request": {"remote_ip": "10.124.0.6", "remote_port": "18436", "client_ip": "10.124.0.6", "proto": "HTTP/2.0", "method": "GET", "host": "www.moonrovrland.com", "uri": "/", "headers": {"Sec-Fetch-Dest": ["document"], "Sec-Ch-Ua": ["\"Google Chrome\";v=\"123\", \"Not:A-Brand\";v=\"8\", \"Chromium\";v=\"123\""], "X-Forwarded-Host": ["www.moonrovrland.com"], "Sec-Ch-Ua-Platform": ["\"macOS\""], "Accept-Language": ["en-US,en;q=0.9"], "Accept": ["text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7"], "Accept-Encoding": ["gzip, deflate, br, zstd"], "Sec-Fetch-User": ["?1"], "Sec-Fetch-Mode": ["navigate"], "Sec-Ch-Ua-Mobile": ["?0"], "Upgrade-Insecure-Requests": ["1"], "User-Agent": ["Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36"], "Sec-Fetch-Site": ["none"], "X-Forwarded-For": ["10.124.0.6"], "X-Forwarded-Proto": ["https"], "Cache-Control": ["max-age=0"]}, "tls": {"resumed": false, "version": 772, "cipher_suite": 4865, "proto": "h2", "server_name": "www.moonrovrland.com"}}, "headers": {"Content-Encoding": ["gzip"], "Connection": ["keep-alive"], "Keep-Alive": ["timeout=5"], "X-Ratelimit-Remaining": ["946"], "Etag": ["W/\"15cc8-BnquPNU4T/HRaoyoyBACt/afwVc\""], "Vary": ["Accept-Encoding"], "Content-Type": ["text/html; charset=utf-8"], "X-Ratelimit-Limit": ["1000"], "Date": ["Sat, 06 Apr 2024 03:06:41 GMT"], "X-Ratelimit-Reset": ["1712372843"]}, "status": 200}

While I’m not very familiar with Redis, Caddy is the only thing using Redis and the DB is full of keys that reference the domains, so I assume things are getting saved there.

3. Caddy version:

v2.7.4

I would test with 2.7.6, but doing so would require a container restart and risks breaking all the hosted websites again. Hoping to get some insight before having to do that.

4. How I installed and ran Caddy:

It’s running in a dedicated Docker container, as a reverse proxy, and managed by Docker Compose (docker-compose.yml included in a later answer). I’m using the aforementioned Redis extension + Cloudflare for wildcard subdomain cert management.

# Dockerfile

# Start with Caddy Builder
FROM caddy:2.7.4-builder-alpine AS builder

# Setup CloudFlare and TLS Redis Plugins
RUN xcaddy build \
	--with github.com/caddy-dns/cloudflare@a9d3ae2690a1d232bc9f8fc8b15bd4e0a6960eec \
	--with github.com/gamalan/caddy-tlsredis@master

# Build Caddy
FROM caddy:2.7.4-alpine
COPY --from=builder /usr/bin/caddy /usr/bin/caddy

a. System environment:

The container OS/version/etc should be explained by the Docker config files in the adjacent answers, but the host machine’s specs are:

  • Ubuntu: 20.04.2
  • Docker: 20.10.7
  • Docker Compose: 1.27.4

b. Command:

Outside the Docker configs already mentioned, the commands should be taken care of by the Docker image.

c. Service/unit/compose file:

# docker-compose.yml

version: '3.5'

services:

    server:
        build: ./server
        restart: unless-stopped
        container_name: server
        env_file:
            - ./.env
        ports:
            - 80:80
            - 443:443
        volumes:
            - ./server/caddy-${ENV}:/etc/caddy # Config

    website: REDACTED

    dashboard: REDACTED

d. My complete Caddy config:

# Caddyfile
{
	email REDACTED
    admin 0.0.0.0:2019
    on_demand_tls {
        ask http://dashboard:3000/api/domain.check
        interval 2m
        burst 5
    }
    storage redis {
        host {$REDIS_HOST}
        port {$REDIS_PORT}
        username {$REDIS_USERNAME}
        password {$REDIS_PASSWORD}
        db {$REDIS_DB}
        tls_enabled {$REDIS_TLS}
    }
    log {
        output file /var/log/caddy/access.log {
            roll_size 100MiB
            roll_uncompressed
            roll_keep 5
            roll_keep_for 48h
        }
        format console
        level DEBUG
    }
}

(headers) {
    header -x-powered-by
}

# Subdomains
*.REDACTED.com {
    tls {
        dns cloudflare {$CLOUDFLARE_TOKEN}
    }
    import headers
    reverse_proxy website:3000
}

# Custom Domains
https:// {
    tls {
        on_demand
    }
    import headers
    reverse_proxy website:3000
}

Please use the latest version, v2.7.6.

This is the on-demand rate limiter. This only allows 5 certificates per 2 minutes. That’s way too little. Just remove this, there’s no need for it.

We’ve made a change recently that reorders the storage and on-demand checks to be reverse from what they used to be. So on-demand is checked before trying to load a certificate from storage.

So if you have aggressive rate limiting on on-demand, it makes it impossible for certs to be loaded from storage.

These options are deprecated anyway (you should be seeing warnings logs at startup)

2 Likes

That seems to have fixed it. Thank you so much!

1 Like