How to eliminate downtime in a Caddy cluster?

zllovesuki · February 20, 2022, 3:59pm

You can use Redis Enterprise Cloud (what I’m using in my cluster setup). The multi-az configuration on aws runs you $9/month (price of coffee in NY). 100MB should handle plenty of certificates.

Josh_Mellicker · February 20, 2022, 9:00pm

Wow @zllovesuki, that’s a killer tip! Had no idea Redis Cloud offered that kind of pricing tier.

Josh_Mellicker · February 22, 2022, 9:16pm

Just wanted to update this thread in case it is helpful to anyone running Caddy in a clustered deployment:

I completely refactored our backend code so it does this:

send a request to the Caddy API on only one server, to add a route to provision a custom domain
(NOTE: this request sometimes takes up to 5 minutes to return a response from Caddy’s API if there’s even light traffic on the server, so make sure your timeouts are long!)
start a loop, check to see if the certs have been downloaded to the shared folder
once the certs are downloaded, the one Caddy instance will serve the site with https as expected
only when the certs are confirmed downloaded, tell all the other Caddy instances about the new route
after a minute or few, the other Caddy instances then serve the site through https with no reload or restart needed, and it works great!

emilylange · February 22, 2022, 10:12pm

Just a quick note on the API timeout part, since many seem to be unaware of that

You could configure a grace period. That way Caddy won’t wait virtually indefinitely to reload, as you said,

Excerpt from the Caddyfile docs:

grace_period
Defines the grace period for shutting down HTTP servers during config reloads. If clients do not finish their requests within the grace period, the server will be forcefully terminated to allow the reload to complete and free up resources.

https://caddyserver.com/docs/json/apps/http/grace_period/

Josh_Mellicker · February 23, 2022, 10:39pm

@emilylange, is this working for you? I set a value of "grace_period":"3s", but then a service caddy reload --force took 17 seconds to complete.

Then set a value of 1s, and a request to the Caddy API took 31 seconds.

matt · February 24, 2022, 12:01am

What do the logs show during that?

Josh_Mellicker · February 24, 2022, 12:17am

{"level":"info","ts":1645661648.640463,"logger":"http","msg":"server is listening only on the HTTPS port but has no TLS connection policies; adding one to enable TLS","server_name":"main","https_port":443}
{"level":"info","ts":1645661648.6405401,"logger":"http","msg":"enabling automatic HTTP->HTTPS redirects","server_name":"main"}
{"level":"info","ts":1645661648.6412334,"logger":"http","msg":"enabling automatic TLS certificate management","domains":["list.com","of.com","domains.com"]}
{"level":"info","ts":1645661676.696655,"logger":"tls.cache.maintenance","msg":"stopped background certificate maintenance","cache":"0xc000a2c620"}

Hey, I thought you were supposed to be on vacation

francislavoie · February 24, 2022, 12:19am

Huh, is the cert maintenance routine taking that long to close off? Strange… maybe context isn’t passed down properly to cancel certain things that should be cancelled. I dunno.

But there’s not a whole lot in those logs to go on, those look pretty normal overall.

matt · February 24, 2022, 2:29am

Ah, yeah, if the storage backend’s locking doesn’t properly honor context cancellation, that could be a problem.

We also could probably pass context into more functions as well. (There’s an open PR for this.)

Josh_Mellicker · February 28, 2022, 7:43am

YET ANOTHER UPDATE:

Above, I said it worked great, but shortly after writing that, we are experiencing numerous problems, and because of that, and the advice from Caddy team to avoid storing certs using shared s3fs / NFS / EFS volumes I’m giving up on that, and going with the excellent suggestion from @zllovesuki to use Redis Enterprise. Seems to be working great, now I just need to write some code to import the content of existing cert files into Redis and we’re going to migrate over.

Thanks all for all the valuable suggestions and help here!

system · March 12, 2022, 9:02pm

This topic was automatically closed after 30 days. New replies are no longer allowed.