Millions of domains across multiple servers with on demand TLS?

1. The problem I’m having:

I need to serve millions of customer domains via HTTPS across multiple servers. I currently use custom NGINX setup with LetsEncrypt certs via both HTTP and DNS ACME challenges. Once the cert is provisioned I add the domain to my NGINX config. It is currently working for 250k domains on one hefty server but running into slow startup time and heavy memory usage. This is obviously not ideal and not scalable.

The on-demand TLS of Caddy seems like a great fit, along with shared cert storage. I have read the real-world use cases of Fathom and Oh-Dear, and other relevant threads linked below. My questions are:

  1. Do you know of anybody serving 1M+ domains with Caddy on-demand TLS?
  2. Any new changes or best practices since the 2022 thread?
  3. I have the upgraded rate-limits for LetsEncrypt. Is there any advantages or disadvantages of using ZeroSSL for this? I am not familiar with them.

Thanks for your help.

5. Links to relevant resources:

1 Like

Welcome!

Yes, sounds like on-demand TLS is what you need. Caddy can do this very well for you at both vertical and horizontal scales.

Caddy limits how many certificates stay in memory at any given time. The default is 10,000. So if you’re serving a million domains, the certs will cycle in and out of memory (somewhat randomly). The cert decoding (when loading into memory) is expensive, but this is minimized because Caddy is smart and will keep on-demand certificates in the cache through config reloads instead of purging the cache every time (like it used to). This means you can change the config quite often and it’s very lightweight.

So your CPU overhead related to lots of certs basically comes down to how often the certs in the cache are being swapped out. Random eviction is easiest and simplest, so that’s what I’ve implemented, but if a large integrator can show with some profiling (and I can help with this) that it’s a significant cause of CPU bottlenecks, then with a sponsorship I can implement smarter cache eviction like LRU.

So anyway, you may find that a single machine performs well enough for your needs.

If not, or your architecture requires that your TLS termination is spread across multiple machines, you will need a cluster. This is as simple as setting all the Caddy instances to have the same storage backend. By default, this is the local file system, but for a cluster, you need some sort of shared storage that supports atomic operations. There are many storage modules available.

The performance of your TLS clustering depends heavily on the performance of the storage backend. For example:

  • The local file system has good support for atomic operations through O_EXCL, for instance.
  • NFS mounts can have difficulty with race conditions due to bugs in NFS implementation(s). We have seen them be less-than-reliable at scale, but also independently of Caddy…
  • Amazon S3 does not provide sufficient atomicity, so it is not a suitable storage backend (and it is very slow!)
  • I’ve heard that the Redis storage backend does a good job.
  • I’ve also heard that a SQL database server such as MySQL or PostgreSQL could do well!
  • Several of our sponsors use DynamoDB, however, this can get expensive quickly so you’ll need to tune your config. It can also be problematic across regions, so I know that one of our large sponsors, for example, has implemented their own caching of DynamoDB across regions.

IMO, running your own Redis/SQL instance is a nice, simple option.

Roughly, yeah. I don’t know of anyone at a higher order of magnitude, but about this. I don’t have exact numbers due to lack of telemetry and changing market conditions, but I know we have several at “250k” and one at “50k/week” (50k more / week? I dunno for sure) – but scaling to 1M is about the same, you just need more disk space.

I can help you get more specific optimizations/advice for your setup with a sponsorship – let’s chat in private about that.

ZeroSSL is an excellent choice for some, here are my thoughts, without knowing your details right now. ZeroSSL has 2 ways of getting certificates: their ACME endpoint, and their API.

Their ACME endpoint is not fast, but at scale like yours I recommend it at least as a backup since they don’t have tight rate limits like LE (which you have an exception for anyway). They do limit how many labels/subdomains you can have to mitigate spam, but overall they’re a fine option.

Their API is a proprietary way of getting certificates that is much faster and allows management through their account UI, but this requires a paid subscription. At your scale you’d want to talk to them about pricing.

That all said, at your scale I definitely do recommend having at least 2-3 CAs to choose from. Caddy can automatically choose them randomly or as backups to give you the highest reliability of any solution.

Common choices are (in whatever order, or no order at all): Let’s Encrypt, ZeroSSL, and Google Trust Services. See more ACME CAs – some may require payment. I think Google Trust Services is free but has some limits.

This won’t be an issue with on-demand TLS since cert loading is deferred to handshake-time.

You’ll still want a fast multicore CPU to make that as fast as possible, but we’re talking ms of actual server startup instead of … I dunno, minutes?

5 Likes

Great, thank you so much @matt for the detailed and thoughtful response and we’ll definitely be getting a Sponsorship once we get going!

I think I’ll start with two smaller web servers so I can jump right into the clustering and not have to figure it out later. For storage, we already run MySQL and a hefty Redis/Elasticache instance at AWS, so those seem like good options worth trying. I’m curious to see if just MySQL would be enough and how response time looks for certs that aren’t in memory. Maybe we could persist the certs from Redis to MySQL, so the shared storage stays fast but we can rebuild it if the Redis instance fails. Nice bit of premature optimization there :).

Having multiple CAs sounds great – fantastic feature. I’m sure I’ll have more questions once I get going.

1 Like

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.