1. Caddy version (caddy version
):
v2.4.5
2. How I run Caddy:
For the part of our stack relevant to this inquiry:
- two servers (Digital Ocean VMs), one in SF, one in NY, each running an instance of Caddy
- a shared mounted volume (using
s3fs
) for certificate storage - Cloudflare is balancing requests between the two servers at the DNS level (not “orange cloud” proxied, just regular “grey cloud” DNS randomly distributing incoming requests between the two servers)
a. System environment:
Linux Ubuntu 20.04.3 LTS
b. Command:
service caddy start
c. Service/unit/compose file:
/etc/systemd/system/caddy.service.d/override.conf
[Service]
ExecStart=
ExecStart=/usr/bin/caddy run --environ --config /var/lib/caddy/.config/caddy/autosave.json
ExecReload=
ExecReload=/usr/bin/caddy reload --config /var/lib/caddy/.config/caddy/autosave.json
d. My complete Caddyfile or JSON config:
Omitted routes, not important
{
"admin": {
"disabled": false,
"listen": ":2019",
"origins": [
"localhost:2019"
]
},
"logging": {
"logs": {
"caddyLogs": {
"writer": {
"output": "file",
"filename": "/var/log/caddy/caddy.log"
}
}
}
},
"apps": {
"http": {
"servers": {
"main" : {
"listen" : [
":443"
],
"logs": {},
"routes" : []
}
}
},
"tls": {
"automation": {
"policies": [
{
"issuers": [
{
"module": "zerossl",
"api_key": "omitted",
"email": "omitted"
},
{
"module": "acme",
"email": "omitted"
}
]
}
]
}
}
},
"storage": {
"module": "file_system",
"root": "/caddy-storage-mount"
}
}
3. The problem I’m having:
TLDR:
When I add a new route to Caddy on both servers, I am checking the certificates folder in a loop to wait until the certificate files are downloaded (already hacky). Once the certs are in place, I have to do a service caddy restart
on the server that did not acquire the cert, which results in 5 - 8 seconds of downtime. I am hoping to find a way to do a zero downtime service caddy reload
or some other way for both servers to “know” about the new certs.
Longer story:
Here is the sequence of events:
- send a POST request to the Caddy API on both servers to add a new route (a custom domain)
- a code loop starts in our API, checking for the existence of the certificate files in the shared mounted volume every 3 seconds
- the SF server starts the process of acquiring the certs from ZeroSSL as expected
- the NY server also tries to get the cert, but gives up when it encounters a
{"level":"error","ts":1644478276.106161,"logger":"tls","msg":"job failed","error":"eight.greenbongo.com: obtaining certificate: unable to acquire lock 'issue_cert_eight.greenbongo.com': decoding lockfile contents: EOF"}
error (which is, I assume, how a Caddy instance knows that another Caddy instance has “dibs” on acquiring the certs?) - SF acquires the certs from ZeroSSL as expected and downloads them to the shared mounted volume
- at this point, any requests that are routed to the SF server work great (without any reloads or restarts)
- however, requests going to the NY server get a nasty SSL error
-
service caddy reload
on the NY server doesn’t change anything, still the SSL error - after a
service caddy restart
, the NY server “knows” about the new certs and after this everything works great, butservice caddy restart
results in 5 - 8 seconds of downtime for all requests going to that server, negatively impacting the UX for users on that server
It’s hard to believe this is how Caddy is supposed to work in a clustered environment, so there must be a better way to configure a Caddy cluster that I could not find in the documentation or anywhere else
4. Error messages and/or full log output:
This log is after adding the new route to Caddy via Caddy API:
(the SF server is called blueprint-sfo3-lightgray-grasshopper
, the NY one is blueprint-nyc3-sandybrown-duck
, please note which logs are from which servers)
Feb 9 23:30:59 blueprint-sfo3-lightgray-grasshopper caddy.log info {"level":"info","ts":1644478259.3067803,"logger":"tls.obtain","msg":"acquiring lock","identifier":"eight.greenbongo.com"}
Feb 9 23:31:01 blueprint-sfo3-lightgray-grasshopper caddy.log info {"level":"info","ts":1644478261.0714962,"logger":"tls.obtain","msg":"lock acquired","identifier":"eight.greenbongo.com"}
Feb 9 23:31:01 blueprint-sfo3-lightgray-grasshopper caddy.log info {"level":"info","ts":1644478261.39716,"logger":"tls.issuance.acme","msg":"waiting on internal rate limiter","identifiers":["eight.greenbongo.com"],"ca":"https://acme.zerossl.com/v2/DV90","account":"omitted"}
Feb 9 23:31:01 blueprint-sfo3-lightgray-grasshopper caddy.log info {"level":"info","ts":1644478261.3975992,"logger":"tls.issuance.acme","msg":"done waiting on internal rate limiter","identifiers":["eight.greenbongo.com"],"ca":"https://acme.zerossl.com/v2/DV90","account":"omitted"}
Feb 9 23:31:09 blueprint-nyc3-sandybrown-duck caddy.log info {"level":"info","ts":1644478269.9866645,"logger":"tls.obtain","msg":"acquiring lock","identifier":"eight.greenbongo.com"}
Feb 9 23:31:16 blueprint-nyc3-sandybrown-duck caddy.log error {"level":"error","ts":1644478276.106161,"logger":"tls","msg":"job failed","error":"eight.greenbongo.com: obtaining certificate: unable to acquire lock 'issue_cert_eight.greenbongo.com': decoding lockfile contents: EOF"}
Feb 9 23:31:18 blueprint-sfo3-lightgray-grasshopper caddy.log info {"level":"info","ts":1644478278.252083,"logger":"tls.issuance.acme.acme_client","msg":"trying to solve challenge","identifier":"eight.greenbongo.com","challenge_type":"http-01","ca":"https://acme.zerossl.com/v2/DV90"}
Feb 9 23:31:23 blueprint-nyc3-sandybrown-duck caddy.log info {"level":"info","ts":1644478283.5799835,"logger":"tls.issuance.acme","msg":"served key authentication","identifier":"eight.greenbongo.com","challenge":"http-01","remote":"91.199.212.132:58156","distributed":true}
Feb 9 23:31:55 blueprint-sfo3-lightgray-grasshopper caddy.log info {"level":"info","ts":1644478315.6375792,"logger":"tls.obtain","msg":"certificate obtained successfully","identifier":"eight.greenbongo.com"}
Feb 9 23:31:55 blueprint-sfo3-lightgray-grasshopper caddy.log info {"level":"info","ts":1644478315.6384492,"logger":"tls.obtain","msg":"releasing lock","identifier":"eight.greenbongo.com"}
Feb 9 23:31:59 blueprint-sfo3-lightgray-grasshopper caddy.log warn {"level":"warn","ts":1644478319.2619808,"logger":"tls","msg":"stapling OCSP","error":"no OCSP stapling for [eight.greenbongo.com]: parsing OCSP response: ocsp: error from server: unauthorized"}```
At this point, the certs are properly downloaded, but only the SF server knows this. NY needs a full service Caddy restart
before it will use the downloaded certs.
5. What I already tried:
- searching documentation for best practices or a tutorial specifically for a Caddy cluster
- searching forums
- searching the internet
- Caddy is awesome, but I found that info on Caddy clustering is rare and vague, for example, when searching the entire documentation for “cluster”, there is only one reference, which in its entirety states “Any Caddy instances that are configured to use the same storage will automatically share those resources and coordinate certificate management as a cluster.”
Please let me know if any additional information would be helpful, also, I would be happy to set up two servers with the shared mount and give someone SSH access if requested. If I can come up with an elegant solution for Caddy clustering without the hacky loop for checking for certs and the downtime during a restart, I would be happy to write up a tutorial or contribute to the documentation.
6. Links to relevant resources:
S3FS: https://github.com/s3fs-fuse/s3fs-fuse
Caddy2 clustering - #2 by matt
Caddy as clustered load balancer - how's that gonna work? - #15 by packeteer