Scaling Caddy to hundreds of thousands of Domains

timp · July 8, 2022, 10:48am

1. Caddy version (`caddy version`):

v2.5.1 h1:bAWwslD1jNeCzDa+jDCNwb8M3UJ2tPa8UZFFzPVmGKs=

2. How I run Caddy:

a. System environment:

Ubuntu 22.04

b. Command:

caddy run 2> error.txt 1> access.txt

c. Service/unit/compose file:

[Unit]
Description=Caddy
Documentation=https://caddyserver.com/docs/
After=network.target network-online.target
Requires=network-online.target

[Service]
Type=notify
User=caddy
Group=caddy
ExecStart=/usr/bin/caddy run --environ --config /etc/caddy/Caddyfile
ExecReload=/usr/bin/caddy reload --config /etc/caddy/Caddyfile
TimeoutStopSec=5s
LimitNOFILE=1048576
LimitNPROC=512
PrivateTmp=true
ProtectSystem=full
AmbientCapabilities=CAP_NET_BIND_SERVICE

[Install]
WantedBy=multi-user.target

d. My complete Caddyfile or JSON config:

{
	email <redacted>
	acme_ca https://acme-staging-v02.api.letsencrypt.org/directory
	on_demand_tls {
		ask http://localhost:5555/check
	}
}

https:// {
	tls {
		on_demand
	}
	root * /var/www/html
	file_server
	log {
		output stdout
	}
}

3. The problem I’m having:

First of all, I would like to mention that I really like working with caddy. The simplicity of the config, as well as the active community, makes it a blast to work with. I am currently evaluating whether caddy is a viable solution to manage TLS for hundreds of thousands of domains. For that, I would like to extend my knowledge of caddy. Many questions have already been answered in previous posts, but for some of the following topics I could not find anything.

Consider the following use case: I would like to manage TLS for 500’000+ domains via caddy. I also have an account with letsencrypt that supports increased rate limits. Ideally, the certificates for the domains should be issued, as well as managed, on-demand. My goal is to make the service as robust and performant as possible within my raised rate limits with letsencrypt

Are there any best practices that you recommend when managing that many domains? (Features such as the ask endpoint which should be used to make the service more performant and robust).
I know that caddy has an internal rate limit which is set to 10 certificates in 1 minute. In previous posts, you have explained that you include this throttle to avoid flooding the CA with certificate requests. Since I have an account with increased rate limits, is there any way I can change this internal limit to better fit my use case (apart from modifying the source code)? Apart from the rate limits of letsencrypt and the internal ones, is there anything else throttling the performance of caddy that can be improved via the config?
How exactly does caddy handle a large amount of concurrent requests for certificates? From the logs, I can see that locks are requested whenever caddy must request a new certificate. Here, it would be interesting to know, for example, how many threads are available to take incoming requests and how the overall performance is when handling that many requests simultaneously. (I am currently also testing the performance myself to see how it might scale to a large number of requests. However, it would be cool if someone smarter than me could explain how exactly the inner workings handle these amounts of requests)
With on-demand TLS active, all certificates are issued and maintained whenever there is the need to. Am I correct in the assumption that with on-demand TLS, the renewal process is never triggered by anything else than an incoming handshake? This is important to know, because I would like to avoid that something triggers the renewal of 100’000 domains simultaneously in the background.

I would really appreciate it if someone could help me with these questions. Thanks in advance!

4. Error messages and/or full log output:

5. What I already tried:

6. Links to relevant resources: Scaling Caddy to hundreds of thousands of Domains Scaling Caddy to hundreds of thousands of Domains

francislavoie · July 8, 2022, 4:21pm

It’s actually 10 per 10 seconds right now (effectively 60 per minute) as of this commit

If you absolutely need to go faster than that, then you’ll need compile a build of Caddy with this rate limit adjusted. We haven’t made that rate limit configurable yet because it’s usually enough for everyone. You might want to give your opinion on this issue Make internal rate limiting configurable · Issue #143 · caddyserver/certmagic · GitHub if you think it should be made configurable (and your point that you have an account with increased rate limits is a good one).

At that rate of 10/10s, it should take ~138 hours to issue 500,000 certs at best. It depends how active each of your domains are, since domains will only be issued on the first TLS handshake from a domain when using On-Demand TLS. That should probably be fast enough for you to do a gradual rollout if you switch over the domains in chunks over the span of a couple weeks.

You might want to adjust the certificate cache capacity (currently only configurable via JSON JSON Config Structure - Caddy Documentation) which defaults to 10,000 certificate entries; this is the amount of certs Caddy will keep in memory at any given time. This avoids Caddy consuming way too much RAM, but if you have enough RAM on the server, this would help performance and avoid file IO to fetch a cert from cache, and avoid thrashing.

Go uses a coroutine concurrency model; it will use as many CPU threads as is available, but you can have hundreds of thousands of goroutines for concurrent processing.

When using On-Demand TLS, issuance happens synchronously in the same goroutine as the request. There’s no limit to concurrency by Caddy itself, but the kernel might limit you in certain ways that you could tweak (I’m not an expert there). You can do some research on optimizing Go programs for high concurrency.

Correct. Renewal will happen during a handshake if the certificate is within its last 1/3 of its lifetime (last 30 days of the 90 day lifetime).

I’ll defer to @matt to clarify other points in case I got anything wrong.

I strongly suggest that you consider sponsoring Caddy, especially if you’re using it for such a large deployment, so you can get prioritized support and ensure the future of the project.

matt · July 8, 2022, 5:51pm

Welcome to the Caddy community Tim! Great questions.

It absolutely is – it’s been specially designed for exactly that use case. Several companies are doing this already.

I second what Francis wrote, and will clarify a few things, and add my own thoughts.

Sometimes. The only times this happens are first-time cert issuance (i.e. no certificate exists to satisfy the handshake) or the certificate is completely expired (i.e. no usable certificate exists to satisfy the handshake). Renewals still happen in the background (but are triggered by handshakes, as opposed to a timer).

Definitely use the ask endpoint. Make sure your machines are powerful enough to decode that many certificates and have enough memory. If you are running multiple Caddy instances, have them share the same storage backend so they’ll share cert resources and coordinate management automatically.

Francis has covered this pretty well: the internal throttle is actually higher now. But let me know how we can better accommodate your use case if it’s not sufficient already.

Like Francis said, we defer you to Go’s concurrency model (and their memory model) and the tuning of your Linux kernel. (You may also find Effective Go handy.) Your server will likely be at its busiest handling normal TLS connections and HTTP requests though, not managing certificates (especially after the initial round of certs has been obtained).

In general, Go is widely accepted of being able to support millions of goroutines as long as memory constraints permit. (One goroutine != one system thread. Multiple goroutines are interleaved onto 1 system thread. The Go scheduler is quite smart because it understands Go code!)

It’s been a while since I’ve looked at that part of the code, but I believe so, yes.

100,000 handshakes at the same time all indicating different hostnames would certainly do it. But even then, as has been explained, Caddy has internal throttling to prevent stampeding herds.

this – I encourage all our large adopters to get a sponsorship to ensure that Caddy runs their critical operations smoothly, and to ensure ongoing maintenance of the project, etc. Sponsorships are customizable based on your needs, too. Let’s talk about that and help you get set up with your upgraded infrastructure!

timp · July 20, 2022, 12:46pm

Thanks Matt and Francis, you have been an incredible help.

I am trying to design a redirect service where you can redirect domains that do not have TLS. The idea behind my current approach is that I issue a cert via a cluster of on-demand Caddys as soon as someone creates a new redirect. Because I want to avoid potential delays at handshakes, I thought about managing the renewal of the certs on a cluster of “pre-issued” Caddys. Additionally, these pre-issued Caddys would also handle the redirect.
Now ideally, the pre-issued Caddys share one config and sit behind a load balancer. This way, I can ensure that if a Caddy goes offline, the others can still manage certs and handle redirects.

Caddy already does offer great support to share resources amongst different instances. You can simply enter the path into the config, and you’re set. However, I could not find anything on how to achieve a shared config amongst different Caddy instances. You can run the Caddys off of the same config initially, however, if I want to enter a new host and redirect target (via the API), I somehow need to propagate the change to all Caddy instances. So far, the best approach I found is to specify a single caddy instance to handle API calls for config changes, write the new config into a file on the shared storage and tell all other Caddys to reload with the new config. Alternatively, I could broadcast the API call to all the Caddy instances. However, both of these approaches seem rather convoluted and come with a lot of drawbacks. Are there best practices when handling a shared config amongst multiple instances where the config changes frequently?

Clustering Caddys does not seem to be an issue as long as the config does not change or the Caddys run on separate configs (but then clustering only really makes sense in an on-demand scenario).

I have also forwarded a request to look into a sponsorship

francislavoie · July 20, 2022, 4:00pm

Just to clarify… do you actually mean redirect or do you mean reverse proxy? Those are different concepts.

A redirect is a kind of HTTP request that contains a Location header which tells the client to make a request at a different URL.

A reverse proxy is when a server (i.e. Caddy) sends the request to be handled by some other server and passes the response back to the client.

We don’t currently have a mechanism for that. You’ll need to push the config changes to all your Caddy instances yourself.

Caddy does have some features for securing the admin API for remote configuration. See this PR’s description which explains:

github.com/caddyserver/caddy

admin: Secure socket for remote management

caddyserver:master ← caddyserver:remote-admin

opened 10:32PM - 25 Jan 21 UTC

mholt

+862 -214

This PR adds 3 separate, but very related features: 1. Automated server ident…ity management 2. Remote administration over secure connection 3. Dyanmic config loading at startup ## 1. Automated server identity management How do you know you're connecting to the server you think you are? How do you know the server connecting to you is the server instance you think it is? Mutually-authenticated TLS (mTLS) answers both of these questions. Using TLS to authenticate requires a public/private key pair (and the peer must trust the certificate you present to it). Fortunately, Caddy is really good at managing certificates by now. We tap into that power to make it possible for Caddy to obtain and renew its own identity credentials, or in other words, a certificate that can be used for both server verification when clients connect to it, and client verification when it connects to other servers. Its associated private key is essentially its identity, and TLS takes care of possession proofs. This configuration is simply a list of identifiers and an optional list of custom certificate issuers. Identifiers are things like IP addresses or DNS names that can be used to access the Caddy instance. The default issuers are ZeroSSL and Let's Encrypt, but these are public CAs, so they won't issue certs for private identifiers. Caddy will simply manage credentials for these, which other parts of Caddy can use, for example: remote administration or dynamic config loading (described below). A bare-bones config might look like this: ```json { "admin": { "identity": { "identifiers": [ "123.123.123.123", "example.com", "127.0.0.1", "localhost" ], "issuers": [ { "module": "acme", "ca": "https://my-acme-server.example.com/", "trusted_roots_pem_files": ["my-acme-root.crt"] } ] } } } ``` Here, Caddy is told that its identities are those IP addresses and DNS names. It then will use your custom ACME server with a custom root certificate (to trust when connecting to it) to get certificates for those identifiers. Note that in this case, your CA would have to issue certs for localhost and 127.0.0.1, which most CAs don't do, since they can't be verified if they are remote. ## 2. Remote administration over secure connection This feature adds generic remote admin functionality that is safe to expose on a public interface. - The "remote" (or "secure") endpoint is optional. It does not affect the standard/local/plaintext endpoint. - It's the same as the [API endpoint on localhost:2019](https://caddyserver.com/docs/api), but over TLS. - TLS cannot be disabled on this endpoint. - TLS mutual auth is required, and cannot be disabled. - The server's certificate _must_ be obtained and renewed via automated means, such as ACME. It cannot be manually loaded. - The TLS server takes care of verifying the client. - The admin handler takes care of application-layer permissions (methods and paths that each client is allowed to use).\ - Sensible defaults are still WIP. - Config fields subject to change/renaming. Here's a basic example config, that I will explain: ```json { "admin": { "identity": { "identifiers": ["example.com"] }, "remote": { "access_control": [ { "public_keys": ["base64-encoded DER certificate"] } ] } } } ``` Explanation: - First we configure identity management. We tell Caddy that its identifier is `example.com`, so it will try to obtain and renew a certificate for that domain. By default, it will use publicly-trusted CAs. This is OK for DNS names that are properly configured. _Identity management is required when enabling remote administration, otherwise the server cannot present a TLS certificate to the client and secure the connection._ - We've enabled a secure admin endpoint at its default address **`:2021`** (you can customize it with `"listen": "..."` just like the regular admin endpoint) - note that the default address is not bound to a local interface, so it can be accessed remotely. - A single public key is then added to the ACL. Only the sole bearer of the associated private key is allowed unrestricted access of the API. We can also restrict different clients/users as for methods and paths they are allowed to access: ```json { "public_keys": ["base64-encoded DER certificate"], "permissions": [{ "paths": ["/id/foo/"], "methods": ["GET"] }] } ``` All the users specified in `public_keys` will be allowed to access all paths in the API starting with `/id/foo/` using only the `GET` method. As you can see, you can specify multiple paths and methods, and multiple groups of them, per group of public keys. Other advanced functionality is a bit limited because we cannot import any Caddy modules: they all import this package instead! So, we cannot import the `caddyhttp` or `caddytls` packages and take advantage of their advanced routing or security logic. The admin controls are relatively simple, but I imagine this should be more than enough...? Caddyfile config can probably be added later. ## 3. Dynamic config loading at startup Since this feature was planned in tandem with remote admin, and depends on its changes, I am combining them into one PR. Dynamic config loading is where you tell Caddy how to load its config, and then it loads and runs that. First, it will load the config you give it (and persist that so it can be optionally resumed later). Then, it will try pulling its _actual_ config using the module you've specified (dynamically loaded configs are _not_ persisted to storage, since resuming them doesn't make sense). This PR comes with a standard config loader module called `caddy.config_loaders.http`. Here's how it looks: ```json { "admin": { "config": { "load": { "module": "http", "url": "https://example.com/my_caddy_config.json" } } } } ``` Caddy will download the config at the given URL and run it. You can also configure authentication -- both client and server -- to ensure you get only trusted configs. If you add this to your config: ``` "tls": { "use_server_identity": true } ``` then Caddy will use the configured identity (explained above) as a client certificate to present to the server it is connecting to. In this case, identity management must also be configured.

Clustered Caddy instances don’t necessarily need to have the same config, since their shared storage is mainly used for storing certificates, and the certificates in storage don’t necessarily need to match what’s in the config.

matt · July 20, 2022, 4:06pm

@yroc92 has some experience syncing config between Caddy instances:

Here, he shows you how to have Caddy fetch its own config at startup. You can also optionally Caddy to regularly fetch a config from a central source on a timer.

If you want to push a config, you can set up remote administration like Francis mentioned:

But yeah, as Francis said, to be in a cluster, they need only share storage, not necessarily config (though that’s OK to do too).

We could probably look into having Caddy instances share their config from storage… that means every Caddy instance in the cluster would have 100% exactly the same configs. Is that what you intend? If so, I can definitely expedite this feature for business sponsors.

system · August 7, 2022, 10:48am

This topic was automatically closed after 30 days. New replies are no longer allowed.