How to eliminate downtime in a Caddy cluster?

You can use Redis Enterprise Cloud (what I’m using in my cluster setup). The multi-az configuration on aws runs you $9/month (price of coffee in NY). 100MB should handle plenty of certificates.

4 Likes

Wow @zllovesuki, that’s a killer tip! Had no idea Redis Cloud offered that kind of pricing tier.

:partying_face: :tada: :purple_heart: :unicorn: :rainbow: :pray: :100: :star2: :bulb:

Just wanted to update this thread in case it is helpful to anyone running Caddy in a clustered deployment:

I completely refactored our backend code so it does this:

  1. send a request to the Caddy API on only one server, to add a route to provision a custom domain
    (NOTE: this request sometimes takes up to 5 minutes to return a response from Caddy’s API if there’s even light traffic on the server, so make sure your timeouts are long!)

  2. start a loop, check to see if the certs have been downloaded to the shared folder

  3. once the certs are downloaded, the one Caddy instance will serve the site with https as expected

  4. only when the certs are confirmed downloaded, tell all the other Caddy instances about the new route

  5. after a minute or few, the other Caddy instances then serve the site through https with no reload or restart needed, and it works great!

:partying_face: :tada: :purple_heart: :unicorn: :rainbow: :pray: :100: :star2: :bulb:

1 Like

Just a quick note on the API timeout part, since many seem to be unaware of that :innocent:

You could configure a grace period. That way Caddy won’t wait virtually indefinitely to reload, as you said,

Excerpt from the Caddyfile docs:

grace_period
Defines the grace period for shutting down HTTP servers during config reloads. If clients do not finish their requests within the grace period, the server will be forcefully terminated to allow the reload to complete and free up resources.

https://caddyserver.com/docs/json/apps/http/grace_period/

3 Likes

@emilylange, is this working for you? I set a value of "grace_period":"3s", but then a service caddy reload --force took 17 seconds to complete.

Then set a value of 1s, and a request to the Caddy API took 31 seconds.

What do the logs show during that?

{"level":"info","ts":1645661648.640463,"logger":"http","msg":"server is listening only on the HTTPS port but has no TLS connection policies; adding one to enable TLS","server_name":"main","https_port":443}
{"level":"info","ts":1645661648.6405401,"logger":"http","msg":"enabling automatic HTTP->HTTPS redirects","server_name":"main"}
{"level":"info","ts":1645661648.6412334,"logger":"http","msg":"enabling automatic TLS certificate management","domains":["list.com","of.com","domains.com"]}
{"level":"info","ts":1645661676.696655,"logger":"tls.cache.maintenance","msg":"stopped background certificate maintenance","cache":"0xc000a2c620"}

Hey, I thought you were supposed to be on vacation :flushed:

Huh, is the cert maintenance routine taking that long to close off? Strange… maybe context isn’t passed down properly to cancel certain things that should be cancelled. I dunno.

But there’s not a whole lot in those logs to go on, those look pretty normal overall.

Ah, yeah, if the storage backend’s locking doesn’t properly honor context cancellation, that could be a problem.

We also could probably pass context into more functions as well. (There’s an open PR for this.)

YET ANOTHER UPDATE:

Above, I said it worked great, but shortly after writing that, we are experiencing numerous problems, and because of that, and the advice from Caddy team to avoid storing certs using shared s3fs / NFS / EFS volumes I’m giving up on that, and going with the excellent suggestion from @zllovesuki to use Redis Enterprise. Seems to be working great, now I just need to write some code to import the content of existing cert files into Redis and we’re going to migrate over.

Thanks all for all the valuable suggestions and help here!

2 Likes

This topic was automatically closed after 30 days. New replies are no longer allowed.