Clustering Caddy

Hi folks. I’m looking to start using Caddy in our environment to enable us to switch all our customers’ custom domains over to SSL, without using 1000’s of IP addresses and updating all those certificates (!)

So my plan is to have our existing Cisco load balancer offloading the majority of our heavy SSL due to the dedicated hardware it contains to help with that, and then load-balance all the other traffic to two Caddy servers. These in turn will be set up as transparent proxy servers to our Varnish cache servers, which also do all sorts of crazy logic to work out what backend servers to send traffic to.

So, to cut long story short, would I have to replicate the certificates to each node? Otherwise, the first time a node sees a domain, it will try and fetch a new cert from LetsEncrypt, right? What will LE do if they get at least 2 requests for each cert?

Can I even just replicate the files in /opt/caddy/ssl/acme/acme-v01.api.letsencrypt.org/sites/ to each node?

thanks folks, Caddy could go a long way towards solving a LOT of our problems :slight_smile:

2 Likes

UPDATE: This is a very popular topic in search results, but it’s also over 3 years old. Caddy now works very well in a cluster to coordinate certificate management and share assets. Please refer to the latest offiical documentation and relevant wiki articles for information.


Hey Mark, I’ve got good news and bad news, and more good news.

Good news: Caddy’s TLS asset storage is designed to be pluggable. Meaning you can plug in a TLS storage provider that takes care of the replication and syncing between Caddy instances, especially useful if the storage is a shared resource.

Bad news: It’s not fully developed yet; you’ll have to change some of the code and compile Caddy yourself, after writing the storage plugin you want.

More good news though: In a little while, we’ll be launching a subscription product for companies who rely on Caddy to ensure their features get considered first and get their bugs fixed before others, as well as guaranteed continued development and private support.

You could do this, but replication may not be enough. Once Caddy loads a certificate, it will try to manage it, including renewing it. You don’t want each node doing that independently.

Yes. So, replication could solve that problem, but…

It will count against your rate limit. Which is bad. If you only replicate, each Caddy will attempt renewals, instead of just one of them. This is why the storage plugin is necessary: it coordinates management of the TLS assets too.

1 Like

Cool, thanks for the info.

Maybe I can get the load balancer to decide the Caddy node based on hostname instead for now. That way no domain should be visible to two caddy nodes, apart from during occasional outages etc. Depends if the load balancer looks at the SNI header or not…

Perhaps I could build some sort of custom directive so I can set a caddy node to do_not_renew_certs. That would achieve a sort of master/slave cluster methodology.

Time to grab the code and start rummaging about :slight_smile: I already want to create a custom log format to output in a format I can send to ElasticSearch.

You could pull a really cheeky tls [cert] [key], where cert and key point to the location you replicate your ACME certificates over from your “master” Caddy…

Aren’t cert and key individual files though? looking at the docs, I guess the load option might let it point to a central store. Although it doesn’t solve the problem of all nodes trying to renew the same certs as they realise they need to be.

dir is a directory from which to load certificates and keys. The entire directory and its subfolders will be walked in search of .pem files. Each .pem file must contain the PEM-encoded certificate (chain) and key blocks, concatenated together.

You can guess the directory and file naming based off the name of the vhost, though. Just a janky way of serving via HTTPS without having Caddy try to manage them.

I’ve been thinking about this bit, and was wondering: when Caddy starts up, does it read the certificate from storage, and then just cache when the expiry is, so it can try to auto-renew it some time before that? Does it ever reload the cert from disk to check for expiry dates?

Just looking through renew.go, if RenewDurationBefore was configurable, and there was a way to flush the in-memory certificate cache (with a USR1 signal?) then it would be possible to set one cluster node to have a RenewDurationBefore a couple of days shorter than the rest, so it will renew any pending certs it knows about, and then all nodes get a “reload” message.

Using healthcheck URLs, I could even perform a full rolling restart of all caddy nodes every night if I need to.

I should clarify that Caddy will only try to renew certificates that it is managing – BYOC (bring your own cert) is different, like other web servers, it’s all up to you then, to renew it and reload the server yourself.

To answer your question for managed certificates: Caddy will load the certificate once and then just run expiry scans every 12 hours. It doesn’t reload a certificate once it has been loaded.

At that point it might be both easier and more reliable to just write a TLS storage plugin to do it correctly… Doing the time-offset thing per-instance is a little hacky/scary to me. :slight_smile:

You’re probably right. Alas, that’s somewhat out of my league - Go isn’t a language I’m particularly familiar with.

I just had a thought. What if I have a specified master Caddy Server configured like I do now with automatic certificates etc

tls on
tls {
  max_certs 10
}

And then I mirror the /opt/caddy/ssl/acme/acme-v01.api.letsencrypt.org/sites folder onto all other nodes, and instead of running them with auto TLS, I specify a load path?

tls on
tls {
  load /opt/caddy/ssl-slave
}

(tls on isn’t a thing – that won’t work, take it out)

This should probably work, as long as the instances started with a load path are signaled to reload after the one instance of Caddy renews certificates. Let me know if it does!

Haha, that probably ended up in there after I was testing with tls off, so I just turned it on. I’ll take it out.

I’ll let you know how it goes with the load - it might make for a fairly simple master/slave possibility.
I was thinking of using something simple like rsync to copy the files around.

Perhaps something with cron, or maybe inotify on the master to kick off a sync:

rsync -avz /directory /target

And then on the slaves either every hour or whatever to a reload.

pkill -USR1 caddy

I could be clever and using inotify to reload on changes, but I’m bit wary of initiating a cascade of reloads if a bunch of files change at once.

In reality, new certs will only come online fairly infrequently, and won’t need to be immediately accessible, and renewals should have many days of leeway. Even a daily sync would probably do.

Handily solved if, instead of kicking off the reload immediately, have inotify null out, then start, a few seconds timer which then kicks of the reload.

systemd.path unit files can do that easily.

A clustered system would need to be designed in a way that accounts for failure of any cluster member.

Go that way and you will arrive at a distributed webserver. If you don’t want to make synchronization and so on part of it, you’d use a database such as CoreOS’ etcd.

A distributed webserver, even just a »clustered one«, needs far more than a shared certificate storage and some kind of locking to prevent two members renewing the same certificate. For example:

  • synchronize any TLS state data (or design for shared-nothing)
  • synchronize configs
  • internal routing to backends (CGI)
  • internal routing to assets, or a separate storage layer (NFS), or something better (hyperconverged design)

I no longer contribute to Caddy, and one reason is that I am writing such a email- and webserver. But compeptitors, for example Nginx, are moving in that direction, too, though they settled on a leader-follower (formerly called master-slave) design. Wait a few months and such webserver designs will be the new mainstream.

1 Like

This gets a little trickier actually, when you consider that behind a load-balancer, only one node will get the pingback from Let’s Encrypt, depending on how it decided which Caddy server to send the request to.

It’s almost like one node needs to make the renewal request, but then all nodes are able to deal with the callback from LE.

For now, until minds smarter than mine build something clever, I may have to settle for a single node with a hot-standby. If I occasionally have to re-request some certs when the node changes, I don’t think that will be a problem.

You’re right, but by using the DNS challenge, no nodes need to handle any sort of request or handshake. It just has to be synchronized so that only one node initiates the challenge; it sets the record with the DNS provider, waits for it to be verified, then clears the record when done, all without hitting the local network.

Alas, we are hosting some web resources for our clients on custom domains, and can’t be monkeying with their DNS.

Ah, I forgot that detail. Yeah, doing the other two challenges distributed is tricky, like you said. Maybe someday (it’s on my long-term list) though!