Scaling Caddy to hundreds of thousands of Domains

1. Caddy version (caddy version):

v2.5.1 h1:bAWwslD1jNeCzDa+jDCNwb8M3UJ2tPa8UZFFzPVmGKs=

2. How I run Caddy:

a. System environment:

Ubuntu 22.04

b. Command:

caddy run 2> error.txt 1> access.txt

c. Service/unit/compose file:

[Unit]
Description=Caddy
Documentation=https://caddyserver.com/docs/
After=network.target network-online.target
Requires=network-online.target

[Service]
Type=notify
User=caddy
Group=caddy
ExecStart=/usr/bin/caddy run --environ --config /etc/caddy/Caddyfile
ExecReload=/usr/bin/caddy reload --config /etc/caddy/Caddyfile
TimeoutStopSec=5s
LimitNOFILE=1048576
LimitNPROC=512
PrivateTmp=true
ProtectSystem=full
AmbientCapabilities=CAP_NET_BIND_SERVICE

[Install]
WantedBy=multi-user.target

d. My complete Caddyfile or JSON config:

{
	email <redacted>
	acme_ca https://acme-staging-v02.api.letsencrypt.org/directory
	on_demand_tls {
		ask http://localhost:5555/check
	}
}

https:// {
	tls {
		on_demand
	}
	root * /var/www/html
	file_server
	log {
		output stdout
	}
}

3. The problem I’m having:

First of all, I would like to mention that I really like working with caddy. The simplicity of the config, as well as the active community, makes it a blast to work with. I am currently evaluating whether caddy is a viable solution to manage TLS for hundreds of thousands of domains. For that, I would like to extend my knowledge of caddy. Many questions have already been answered in previous posts, but for some of the following topics I could not find anything.

Consider the following use case: I would like to manage TLS for 500’000+ domains via caddy. I also have an account with letsencrypt that supports increased rate limits. Ideally, the certificates for the domains should be issued, as well as managed, on-demand. My goal is to make the service as robust and performant as possible within my raised rate limits with letsencrypt

  • Are there any best practices that you recommend when managing that many domains? (Features such as the ask endpoint which should be used to make the service more performant and robust).
  • I know that caddy has an internal rate limit which is set to 10 certificates in 1 minute. In previous posts, you have explained that you include this throttle to avoid flooding the CA with certificate requests. Since I have an account with increased rate limits, is there any way I can change this internal limit to better fit my use case (apart from modifying the source code)? Apart from the rate limits of letsencrypt and the internal ones, is there anything else throttling the performance of caddy that can be improved via the config?
  • How exactly does caddy handle a large amount of concurrent requests for certificates? From the logs, I can see that locks are requested whenever caddy must request a new certificate. Here, it would be interesting to know, for example, how many threads are available to take incoming requests and how the overall performance is when handling that many requests simultaneously. (I am currently also testing the performance myself to see how it might scale to a large number of requests. However, it would be cool if someone smarter than me could explain how exactly the inner workings handle these amounts of requests)
  • With on-demand TLS active, all certificates are issued and maintained whenever there is the need to. Am I correct in the assumption that with on-demand TLS, the renewal process is never triggered by anything else than an incoming handshake? This is important to know, because I would like to avoid that something triggers the renewal of 100’000 domains simultaneously in the background.

I would really appreciate it if someone could help me with these questions. Thanks in advance!

4. Error messages and/or full log output:

5. What I already tried:

6. Links to relevant resources: Scaling Caddy to hundreds of thousands of Domains Scaling Caddy to hundreds of thousands of Domains

It’s actually 10 per 10 seconds right now (effectively 60 per minute) as of this commit

If you absolutely need to go faster than that, then you’ll need compile a build of Caddy with this rate limit adjusted. We haven’t made that rate limit configurable yet because it’s usually enough for everyone. You might want to give your opinion on this issue Make internal rate limiting configurable · Issue #143 · caddyserver/certmagic · GitHub if you think it should be made configurable (and your point that you have an account with increased rate limits is a good one).

At that rate of 10/10s, it should take ~138 hours to issue 500,000 certs at best. It depends how active each of your domains are, since domains will only be issued on the first TLS handshake from a domain when using On-Demand TLS. That should probably be fast enough for you to do a gradual rollout if you switch over the domains in chunks over the span of a couple weeks.

You might want to adjust the certificate cache capacity (currently only configurable via JSON JSON Config Structure - Caddy Documentation) which defaults to 10,000 certificate entries; this is the amount of certs Caddy will keep in memory at any given time. This avoids Caddy consuming way too much RAM, but if you have enough RAM on the server, this would help performance and avoid file IO to fetch a cert from cache, and avoid thrashing.

Go uses a coroutine concurrency model; it will use as many CPU threads as is available, but you can have hundreds of thousands of goroutines for concurrent processing.

When using On-Demand TLS, issuance happens synchronously in the same goroutine as the request. There’s no limit to concurrency by Caddy itself, but the kernel might limit you in certain ways that you could tweak (I’m not an expert there). You can do some research on optimizing Go programs for high concurrency.

Correct. Renewal will happen during a handshake if the certificate is within its last 1/3 of its lifetime (last 30 days of the 90 day lifetime).

I’ll defer to @matt to clarify other points in case I got anything wrong.

I strongly suggest that you consider sponsoring Caddy, especially if you’re using it for such a large deployment, so you can get prioritized support and ensure the future of the project.

2 Likes

Welcome to the Caddy community Tim! Great questions.

It absolutely is – it’s been specially designed for exactly that use case. Several companies are doing this already.

I second what Francis wrote, and will clarify a few things, and add my own thoughts.

Sometimes. The only times this happens are first-time cert issuance (i.e. no certificate exists to satisfy the handshake) or the certificate is completely expired (i.e. no usable certificate exists to satisfy the handshake). Renewals still happen in the background (but are triggered by handshakes, as opposed to a timer).

Definitely use the ask endpoint. Make sure your machines are powerful enough to decode that many certificates and have enough memory. If you are running multiple Caddy instances, have them share the same storage backend so they’ll share cert resources and coordinate management automatically.

Francis has covered this pretty well: the internal throttle is actually higher now. But let me know how we can better accommodate your use case if it’s not sufficient already.

Like Francis said, we defer you to Go’s concurrency model (and their memory model) and the tuning of your Linux kernel. (You may also find Effective Go handy.) Your server will likely be at its busiest handling normal TLS connections and HTTP requests though, not managing certificates (especially after the initial round of certs has been obtained).

In general, Go is widely accepted of being able to support millions of goroutines as long as memory constraints permit. (One goroutine != one system thread. Multiple goroutines are interleaved onto 1 system thread. The Go scheduler is quite smart because it understands Go code!)

It’s been a while since I’ve looked at that part of the code, but I believe so, yes.

100,000 handshakes at the same time all indicating different hostnames would certainly do it. But even then, as has been explained, Caddy has internal throttling to prevent stampeding herds.

:100: this – I encourage all our large adopters to get a sponsorship to ensure that Caddy runs their critical operations smoothly, and to ensure ongoing maintenance of the project, etc. Sponsorships are customizable based on your needs, too. Let’s talk about that and help you get set up with your upgraded infrastructure!

3 Likes

Thanks Matt and Francis, you have been an incredible help.

I am trying to design a redirect service where you can redirect domains that do not have TLS. The idea behind my current approach is that I issue a cert via a cluster of on-demand Caddys as soon as someone creates a new redirect. Because I want to avoid potential delays at handshakes, I thought about managing the renewal of the certs on a cluster of “pre-issued” Caddys. Additionally, these pre-issued Caddys would also handle the redirect.
Now ideally, the pre-issued Caddys share one config and sit behind a load balancer. This way, I can ensure that if a Caddy goes offline, the others can still manage certs and handle redirects.

Caddy already does offer great support to share resources amongst different instances. You can simply enter the path into the config, and you’re set. However, I could not find anything on how to achieve a shared config amongst different Caddy instances. You can run the Caddys off of the same config initially, however, if I want to enter a new host and redirect target (via the API), I somehow need to propagate the change to all Caddy instances. So far, the best approach I found is to specify a single caddy instance to handle API calls for config changes, write the new config into a file on the shared storage and tell all other Caddys to reload with the new config. Alternatively, I could broadcast the API call to all the Caddy instances. However, both of these approaches seem rather convoluted and come with a lot of drawbacks. Are there best practices when handling a shared config amongst multiple instances where the config changes frequently?

Clustering Caddys does not seem to be an issue as long as the config does not change or the Caddys run on separate configs (but then clustering only really makes sense in an on-demand scenario).

I have also forwarded a request to look into a sponsorship :slight_smile:

1 Like

Just to clarify… do you actually mean redirect or do you mean reverse proxy? Those are different concepts.

A redirect is a kind of HTTP request that contains a Location header which tells the client to make a request at a different URL.

A reverse proxy is when a server (i.e. Caddy) sends the request to be handled by some other server and passes the response back to the client.

We don’t currently have a mechanism for that. You’ll need to push the config changes to all your Caddy instances yourself.

Caddy does have some features for securing the admin API for remote configuration. See this PR’s description which explains:

Clustered Caddy instances don’t necessarily need to have the same config, since their shared storage is mainly used for storing certificates, and the certificates in storage don’t necessarily need to match what’s in the config.

1 Like

@yroc92 has some experience syncing config between Caddy instances:

Here, he shows you how to have Caddy fetch its own config at startup. You can also optionally Caddy to regularly fetch a config from a central source on a timer.

If you want to push a config, you can set up remote administration like Francis mentioned:

But yeah, as Francis said, to be in a cluster, they need only share storage, not necessarily config (though that’s OK to do too).

We could probably look into having Caddy instances share their config from storage… that means every Caddy instance in the cluster would have 100% exactly the same configs. Is that what you intend? If so, I can definitely expedite this feature for business sponsors.

This topic was automatically closed after 30 days. New replies are no longer allowed.