Introduction
There are two issues I’m trying to address here. If you think I should split them out into separate posts (or move them to another category, let me know).
I’m using Kubernetes on GKE which throws up some interesting problems with automated certificates, pod restarts, multiple Caddy front-ends managing certs etc.
1. Sharing Certificates
I assume that it’s best for all the servers to have the same certificates. I believe it’s not mandatory, but it would be odd (and inefficient on the client) not to do this.
2. Managing Certificates
This is easy enough by putting /root/.caddy/acme
on a shared disk/volume (gcePersistentDisk on GKE) but then you have multiple Caddy servers checking on the age of the certificate, and requesting renewals when it gets close to expiry date.
One way around this would be to have one pod (that is configured to automate certificate requests, and configure all subsequent ones with a static entry (as in the syntax for “To use Caddy with your own certificate and key:” tls cert key) for those keys.
Of course, then the certificate-management pod will have to signal all the others to reload when a new certificate is installed… I haven’t worked that out yet.
(There are other more complicated solutions that would involve ConfigMaps, stdin and Secrets, but I’ve not looked into them - they don’t solve any of the issues mentioned)
3. Restarts near renewal time
This is a known problem (not with Caddy, per se, more of with orchestration, pre-empting and bad set ups), and not limited to Kubernetes.
If a server keeps restarting (e.g in a crash loop) close to renewal, one breaks the rate limit for Let’s Encrypt. How can one mitigate oneself against this?
Backing up the current certificate would certainly be beneficial (so you don’t end up with no certificate whatsoever,) then, if it hasn’t expired, it could at least be put back in place. The question is, can this functionality be placed within Caddy? (Should it? How would one mark that certificate renewal can’t happen for a week?)
How do other people solve this? And yes, I have done it, by overloading my Kubernetes cluster which then put the pod into a crash loop.