For my stack I have Docker Swarm configured to start Caddy as a service. So every time I change the Caddyfile, this is what happens:
- CI pipeline produces a new Docker image with the changed Caddyfile.
- CI starts a service update procedure on the swarm.
- One of the 3 running caddy containers is stopped and removed from loadbalancing.
- A new caddy container is started with the new image. This new container is then watched with a healthcheck for a configurable amount of time. If it does not respond within this deadline, the update process is considered a failure (this image must have a bug) and everything is rolled back.
- When the first container start listening, a second one will be stopped and a new one will be created also waiting for the healthcheck. This repeats until all caddy containers are using the new image.
This works create for zero-downtime deployments, and allow me to update caddy, change plugins, etc, without issues. The problem is emitting TLS certificates:
If I add too many new domains in a single update (say… 5 domains), the process of obtaining certificates will take some time and this time may cause the update process to fail (note: previously obtained certificates are preserved between updates). This means I have to set a long time (1 minute?) before the healthcheck is really allowed to fail. If I use such a long time, the update process will be slow (5 minutes?) to complete, and if some day I need to add 10 domains in one go, this will still fail. Not good.
My first attempt of a solution for this was to use OnDemandTLS for all hosts, which mean certificate will be obtained lazily. This is okish since I can visit the site manually myself (actually, the uptime monitoring tool will visit the site frequently and force it to always have a ready certificate). The problem is that when I do that, caddy seems to not take into account that most of the certs were already obtained and are in the .caddy directory. It does not load certs from there and obtained a new cert even for a domain we had for quite a while. Not good.
The dream solution would be to have the “Activating privacy features…” be a non-blocking phase. If a vhost can start serving, let it serve as soon as possible. Ofc the vhost without cert will have to wait (or maybe be served with a incorrect cert), but that’s ok. Is it possible in the current design?
Thanks!