How to eliminate downtime in a Caddy cluster?

1. Caddy version (caddy version):

v2.4.5

2. How I run Caddy:

For the part of our stack relevant to this inquiry:

  • two servers (Digital Ocean VMs), one in SF, one in NY, each running an instance of Caddy
  • a shared mounted volume (using s3fs) for certificate storage
  • Cloudflare is balancing requests between the two servers at the DNS level (not “orange cloud” proxied, just regular “grey cloud” DNS randomly distributing incoming requests between the two servers)

a. System environment:

Linux Ubuntu 20.04.3 LTS

b. Command:

service caddy start

c. Service/unit/compose file:

/etc/systemd/system/caddy.service.d/override.conf

[Service]
ExecStart=
ExecStart=/usr/bin/caddy run --environ --config /var/lib/caddy/.config/caddy/autosave.json
ExecReload=
ExecReload=/usr/bin/caddy reload --config /var/lib/caddy/.config/caddy/autosave.json

d. My complete Caddyfile or JSON config:

Omitted routes, not important

{
  "admin": {
    "disabled": false,
    "listen": ":2019",
    "origins": [
      "localhost:2019"
    ]
  },
  "logging": {
    "logs": {
      "caddyLogs": {
        "writer": {
            "output": "file",
            "filename": "/var/log/caddy/caddy.log"
        }
      }
    }
  },
  "apps": {
    "http": {
      "servers": {
        "main" : {
          "listen" : [
            ":443"
          ],
          "logs": {},
          "routes" : []
        }
      }
    },
    "tls": {
      "automation": {
        "policies": [
          {
            "issuers": [
              {
                "module": "zerossl",
                "api_key": "omitted",
                "email": "omitted"
              },
              {
                "module": "acme",
                "email": "omitted"
              }
            ]
          }
        ]
      }
    }
  },
  "storage": {
    "module": "file_system",
    "root": "/caddy-storage-mount"
  }
}

3. The problem I’m having:

TLDR:

When I add a new route to Caddy on both servers, I am checking the certificates folder in a loop to wait until the certificate files are downloaded (already hacky). Once the certs are in place, I have to do a service caddy restart on the server that did not acquire the cert, which results in 5 - 8 seconds of downtime. I am hoping to find a way to do a zero downtime service caddy reload or some other way for both servers to “know” about the new certs.

Longer story:

Here is the sequence of events:

  1. send a POST request to the Caddy API on both servers to add a new route (a custom domain)
  2. a code loop starts in our API, checking for the existence of the certificate files in the shared mounted volume every 3 seconds
  3. the SF server starts the process of acquiring the certs from ZeroSSL as expected
  4. the NY server also tries to get the cert, but gives up when it encounters a {"level":"error","ts":1644478276.106161,"logger":"tls","msg":"job failed","error":"eight.greenbongo.com: obtaining certificate: unable to acquire lock 'issue_cert_eight.greenbongo.com': decoding lockfile contents: EOF"} error (which is, I assume, how a Caddy instance knows that another Caddy instance has “dibs” on acquiring the certs?)
  5. SF acquires the certs from ZeroSSL as expected and downloads them to the shared mounted volume
  6. at this point, any requests that are routed to the SF server work great (without any reloads or restarts)
  7. however, requests going to the NY server get a nasty SSL error
  8. service caddy reload on the NY server doesn’t change anything, still the SSL error
  9. after a service caddy restart, the NY server “knows” about the new certs and after this everything works great, but service caddy restart results in 5 - 8 seconds of downtime for all requests going to that server, negatively impacting the UX for users on that server

It’s hard to believe this is how Caddy is supposed to work in a clustered environment, so there must be a better way to configure a Caddy cluster that I could not find in the documentation or anywhere else

4. Error messages and/or full log output:

This log is after adding the new route to Caddy via Caddy API:

(the SF server is called blueprint-sfo3-lightgray-grasshopper, the NY one is blueprint-nyc3-sandybrown-duck, please note which logs are from which servers)

Feb 9 23:30:59 blueprint-sfo3-lightgray-grasshopper caddy.log info {"level":"info","ts":1644478259.3067803,"logger":"tls.obtain","msg":"acquiring lock","identifier":"eight.greenbongo.com"}

Feb 9 23:31:01 blueprint-sfo3-lightgray-grasshopper caddy.log info {"level":"info","ts":1644478261.0714962,"logger":"tls.obtain","msg":"lock acquired","identifier":"eight.greenbongo.com"}

Feb 9 23:31:01 blueprint-sfo3-lightgray-grasshopper caddy.log info {"level":"info","ts":1644478261.39716,"logger":"tls.issuance.acme","msg":"waiting on internal rate limiter","identifiers":["eight.greenbongo.com"],"ca":"https://acme.zerossl.com/v2/DV90","account":"omitted"}

Feb 9 23:31:01 blueprint-sfo3-lightgray-grasshopper caddy.log info {"level":"info","ts":1644478261.3975992,"logger":"tls.issuance.acme","msg":"done waiting on internal rate limiter","identifiers":["eight.greenbongo.com"],"ca":"https://acme.zerossl.com/v2/DV90","account":"omitted"}

Feb 9 23:31:09 blueprint-nyc3-sandybrown-duck caddy.log info {"level":"info","ts":1644478269.9866645,"logger":"tls.obtain","msg":"acquiring lock","identifier":"eight.greenbongo.com"}

Feb 9 23:31:16 blueprint-nyc3-sandybrown-duck caddy.log error {"level":"error","ts":1644478276.106161,"logger":"tls","msg":"job failed","error":"eight.greenbongo.com: obtaining certificate: unable to acquire lock 'issue_cert_eight.greenbongo.com': decoding lockfile contents: EOF"}

Feb 9 23:31:18 blueprint-sfo3-lightgray-grasshopper caddy.log info {"level":"info","ts":1644478278.252083,"logger":"tls.issuance.acme.acme_client","msg":"trying to solve challenge","identifier":"eight.greenbongo.com","challenge_type":"http-01","ca":"https://acme.zerossl.com/v2/DV90"}

Feb 9 23:31:23 blueprint-nyc3-sandybrown-duck caddy.log info {"level":"info","ts":1644478283.5799835,"logger":"tls.issuance.acme","msg":"served key authentication","identifier":"eight.greenbongo.com","challenge":"http-01","remote":"91.199.212.132:58156","distributed":true}

Feb 9 23:31:55 blueprint-sfo3-lightgray-grasshopper caddy.log info {"level":"info","ts":1644478315.6375792,"logger":"tls.obtain","msg":"certificate obtained successfully","identifier":"eight.greenbongo.com"}

Feb 9 23:31:55 blueprint-sfo3-lightgray-grasshopper caddy.log info {"level":"info","ts":1644478315.6384492,"logger":"tls.obtain","msg":"releasing lock","identifier":"eight.greenbongo.com"}

Feb 9 23:31:59 blueprint-sfo3-lightgray-grasshopper caddy.log warn {"level":"warn","ts":1644478319.2619808,"logger":"tls","msg":"stapling OCSP","error":"no OCSP stapling for [eight.greenbongo.com]: parsing OCSP response: ocsp: error from server: unauthorized"}```

At this point, the certs are properly downloaded, but only the SF server knows this. NY needs a full service Caddy restart before it will use the downloaded certs.

5. What I already tried:

  • searching documentation for best practices or a tutorial specifically for a Caddy cluster
  • searching forums
  • searching the internet
  • Caddy is awesome, but I found that info on Caddy clustering is rare and vague, for example, when searching the entire documentation for “cluster”, there is only one reference, which in its entirety states “Any Caddy instances that are configured to use the same storage will automatically share those resources and coordinate certificate management as a cluster.”

Please let me know if any additional information would be helpful, also, I would be happy to set up two servers with the shared mount and give someone SSH access if requested. If I can come up with an elegant solution for Caddy clustering without the hacky loop for checking for certs and the downtime during a restart, I would be happy to write up a tutorial or contribute to the documentation.

6. Links to relevant resources:

S3FS: https://github.com/s3fs-fuse/s3fs-fuse
Caddy2 clustering - #2 by matt
Caddy as clustered load balancer - how's that gonna work? - #15 by packeteer

1 Like

You probably shouldn’t be using --config in this case, you should use the --resume flag instead, which is specifically designed for this usecase. See the docs, we have a caddy-api.service meant for this:

You don’t need to do restart (and you shouldn’t, because that specifically causes downtime), you just need to force a reload, which you can do by using the --force flag of the caddy reload command. By default, reloading does nothing unless the config is actually different than the currently running one.

The other option to alleviate this would be to use On-Demand TLS, which would mean that Caddy would check in storage during the TLS handshake to try to find the certificate (and attempt to issue one if none is found). If you turn on On-Demand TLS, you’ll want to set up an ask endpoint to limit which domains for which Caddy will try to issue a cert.

Hmm, let me look into this as soon as I get a chance. Might be swamped until after the weekend but Francis poked me in Slack about this so I’ll prioritize it!

1 Like

Ok actually I just talked with Francis about this and we kind of think the real bug is the empty lock files (decoding lockfile contents: EOF).

Because the situation you describe should work (and AFAIK, does work for other users in a cluster). If one Caddy instance is getting a cert, the other one(s) should wait until it’s done, the lock file will go away, and then that instance will obtain the lock. Then it will verify that a cert still needs to be obtained while within the lock. If the cert already exists, it will load and use it.

I’ve tested that, I know it works – but when the lock file can’t be read because it’s empty, we can’t determine if it’s stale or not. So if we can’t lock reliably, I can see why this problem would occur.

Ideally, the lock file should not be empty. I would like to figure out why that is happening…

1 Like

Thanks so much for your replies!

Also wanted to mention, in the served key authentication logs, sometimes Caddy says distributed: true and other times distributed: false, seemingly randomly… could this be a helpful clue? Any idea why?

(I would assume it should always be distributed: true in a clustered deployment?)

Feb 9 12:45:38 blueprint-nyc3-sandybrown-duck caddy.log info {"level":"info","ts":1644439538.8348293,"logger":"tls.issuance.acme","msg":"served key authentication","identifier":"one.greenbongo.com","challenge":"http-01","remote":"91.199.212.132:33176","distributed":false}

Feb 9 12:55:31 blueprint-sfo3-lightgray-grasshopper caddy.log info {"level":"info","ts":1644440131.8291857,"logger":"tls.issuance.acme","msg":"served key authentication","identifier":"leeloo.greenbongo.com","challenge":"http-01","remote":"91.199.212.132:45272","distributed":false}

Feb 9 12:58:27 blueprint-sfo3-lightgray-grasshopper caddy.log info {"level":"info","ts":1644440307.4106588,"logger":"tls.issuance.acme","msg":"served key authentication","identifier":"bird.greenbongo.com","challenge":"http-01","remote":"91.199.212.132:40926","distributed":false}

Feb 9 13:09:14 blueprint-nyc3-sandybrown-duck caddy.log info {"level":"info","ts":1644440954.6563065,"logger":"tls.issuance.acme","msg":"served key authentication","identifier":"fish.greenbongo.com","challenge":"http-01","remote":"91.199.212.132:56026","distributed":true}

Feb 9 16:29:40 blueprint-sfo3-lightgray-grasshopper caddy.log info {"level":"info","ts":1644452980.5543103,"logger":"tls.issuance.acme","msg":"served key authentication","identifier":"four.greenbongo.com","challenge":"http-01","remote":"91.199.212.132:47084","distributed":false}

Feb 9 16:34:17 blueprint-nyc3-sandybrown-duck caddy.log info {"level":"info","ts":1644453257.305045,"logger":"tls.issuance.acme","msg":"served key authentication","identifier":"five.greenbongo.com","challenge":"http-01","remote":"91.199.212.132:48706","distributed":true}

Feb 9 16:42:03 blueprint-nyc3-sandybrown-duck caddy.log info {"level":"info","ts":1644453723.7032986,"logger":"tls.issuance.acme","msg":"served key authentication","identifier":"six.greenbongo.com","challenge":"http-01","remote":"91.199.212.132:38940","distributed":true}

Feb 9 22:11:28 blueprint-sfo3-lightgray-grasshopper caddy.log info {"level":"info","ts":1644473487.6851633,"logger":"tls.issuance.acme","msg":"served key authentication","identifier":"ava.greenbongo.com","challenge":"http-01","remote":"91.199.212.132:40406","distributed":false}

Feb 9 23:06:44 blueprint-sfo3-lightgray-grasshopper caddy.log info {"level":"info","ts":1644476804.293307,"logger":"tls.issuance.acme","msg":"served key authentication","identifier":"five.greenbongo.com","challenge":"http-01","remote":"91.199.212.132:48224","distributed":false}

Feb 9 23:31:23 blueprint-nyc3-sandybrown-duck caddy.log info {"level":"info","ts":1644478283.5799835,"logger":"tls.issuance.acme","msg":"served key authentication","identifier":"eight.greenbongo.com","challenge":"http-01","remote":"91.199.212.132:58156","distributed":true}

Awesome tip @francislavoie, this works great, and solves the UX problem!!

Caddy team rocks!!!

:partying_face: :tada: :purple_heart: :unicorn: :rainbow: :pray: :100: :star2: :bulb:

2 Likes

You probably shouldn’t be using --config in this case, you should use the --resume flag instead,

@francislavoie what is the difference between the two? (could not find in docs)

From the command line docs:

  • --config specifies the config file to use.
  • --resume tells Caddy to resume from the last-loaded config (even if modified by the API). If you’re using the API, you usually won’t have a config file since the API is making live updates to your config, rendering the file obsolete.
2 Likes

@matt should I open an issue about this on Github?

That value is false if the ACME challenge info (needed to solve the challenge) was found in memory by the local instance; i.e. the same instance started that challenge. It is true if it had to load the challenge data from storage; i.e. a different instance started the challenge.

Probably not; turns out NFS is a buggy file system: What is going on with AWS EFS? · Issue #169 · caddyserver/certmagic · GitHub

As I interpret the docs, if you are running Caddy with the config specified as an autosave.json file:

[Service]
ExecStart=
ExecStart=/usr/bin/caddy run --environ --config /var/lib/caddy/.config/caddy/autosave.json
ExecReload=
ExecReload=/usr/bin/caddy reload --config /var/lib/caddy/.config/caddy/autosave.json

then --config and --resume would do the same thing, since all API updates are persisted to the autosave.json file, right?

Kinda, but --resume is the built-in way to do that, making your config not depend on where the autosave file is stored.

Yeah, like Francis said, I think that’s redundant. You only need --resume if you’re using the API for changes.

  1. Do you mean my override file should be:
[Service]
ExecStart=
ExecStart=/usr/bin/caddy run --environ --config /var/lib/caddy/.config/caddy/autosave.json
ExecReload=
ExecReload=/usr/bin/caddy reload --resume

and when I execute service caddy reload --force Caddy will always load the config from /var/lib/caddy/.config/caddy/autosave.json?

  1. Also, is --resume:

A. more performant/faster?

or

B. just a preferred convention?

In your test, can you describe your setup? Specifically, what did you use for shared cert storage? An NFS mount? An s3fs mount? Or something else?

I am not using NFS, so my problem is not related to the NFS bug, I’m using s3fs (which uses FUSE under the hood, I don’t think it’s likely that NFS and FUSE have the same bug).

To recap:

  1. Two servers with Caddy sharing cert storage, server A fetches the cert, server B logs an EOF error on the lockfile.
  2. Requests to server B for the domain with the newly downloaded cert return an SSL error
  3. After executing caddy reload --force on server B, it now “knows” the new certs exist and will serve the site securely as expected.

It would be great to figure out a way to solve this, so if valid certs exist, Caddy will serve a domain securely instead of throwing an SSL error. (And not needing a hacky loop and a reload.)

I would be happy to set up a couple of servers if you’d like.

No, --resume is a flag for caddy run, not caddy reload. See the docs:

You probably don’t need caddy reload at all, because any change to the config is a reload anyways.

The caddy-api.service file we ship doesn’t have ExecReload at all because it doesn’t really make sense for the API usecase:

If you do need to force reload then you can use the /load endpoint with the Cache-Control: must-revalidate header to force it (which is what --force flag on the caddy reload command does anyways). This is documented here:

It’s just simpler to use altogether. Same performance.

S3 is not a filesystem. It’s not safe to use when locks are needed, for consistency. S3 is eventually consistent. Anything built on top of S3 isn’t going to be able to solve that problem.

@francislavoie thanks for the info.

If you were designing a simple Caddy deployment for a production environment (2 servers in two regions would be sufficient to eliminate a SPOF), what shared storage option for TLS certs would you use?

(I would like to avoid setting up a cross-region Redis cluster, that’s a whole new load of infrastructure to set up and maintain just for shared storage, and it seems Redis is not “officially” supported, it’s a community contribution with less than 100 stars, so what is the simplest solution you would recommend?)

Honestly, I don’t have a good answer for that.

The Redis plugin is very stable, FYI. It’s really simple, because Redis is simple and is reliable.

Maybe Consul, if not Redis.

Maybe something like GlusterFS or Ceph might work for you. But I’ve never deployed those myself.

There is this Postgres storage backend that looks like it could do the job: GitHub - yroc92/postgres-storage

1 Like

I have configured a two server test environment where you can replicate the Caddy issue, details are here: https://www.notion.so/platformpurple/Caddy-cluster-project-6c46110e3bf04fde8e4eb44c79559912

Please try it, it seems like, since Caddy can read and write fine from the shared mount, that this issue can be fixed, hopefully without too much effort!

1 Like