Should I build a queue for admin API updates on my end, if there are a lot at once?

Carter_Bryden · December 15, 2020, 9:43pm

Hi there, I’m not looking for specific technical help, more of a question regarding how well Caddy v2 will be able to handle a bunch (say, 100) of config updates all coming in very close together.

I’m asking because I’m building an automated system for creating virtual hosts, and there may be some instances where a lot of new virtual hosts might be created at once through the admin API.

I’m wondering if I should implement a queueing system right out of the gate on my end, or if that would be a waste because caddy already queues those up.

So far, my tests have mostly worked out, though I get a few errors like so:

  job failed      {"error": "test417.fileshare.page: obtaining certificate: [test417.fileshare.page] Obtain: context canceled"}

Which I suspect are because I’m sending dozens of config updates really quickly.

Thanks!

matt · December 15, 2020, 10:54pm

Config updates acquire a lock so that only 1 update happens concurrently. The majority of the time spent in config reloads is shutting down HTTP servers gracefully: this means starting the new listeners (sometimes OSes take many ms for this system call), waiting for in-flight requests to finish on the old listeners, and then finally closing out the old listeners. We have to block during this because otherwise it’ll be impossible to know whether the reload was successful. And we can’t have multiple reloads happening at a time because, well, data races and other worse, awful things.

While Caddy itself can handle hundreds of config reloads per second (I tested this by disabling all graceful features and network listeners then hammering it really hard), it’s the OS that’s too slow to keep up. Those system calls are buggers.

To ease contention, you could configure less graceful reloads, for example by shortening the grace period: JSON Config Structure - Caddy Documentation

However, the way I’d recommend is that you batch your API updates instead. Like I said, Caddy can handle frequent config updates fine, but you have to cater to what your OS is capable of. So if you’re giving it 100 config updates so close together and your OS can’t keep up (or you can’t gracefully cycle servers fast enough), you can simply combine them into one update. What this looks like depends on what your API calls are. If you’re just adding hostnames, this is really, really simple.

As you noticed, something else to consider is certificate operations. These can take a few seconds to a few minutes. To avoid leaking resources and to avoid spamming CAs with transactions, we cancel them when the associated config is unloaded. We have to cancel them because there’s no way to know if it’ll be needed with the new config. We could wait and find out, but that requires yet another goroutine that – you guessed it – waits until the config is fully loaded and then does some sort of diff of the two configs and then applies only the delta. But then we’d have to do that for potentially every single config parameter. And that goroutine that waits is itself another resource, so it would still end up leaking resources for frequent reloads.

So, the only “queueing” that Caddy is doing is the natural blocking of HTTP requests on a mutex over the config value. I guess that certainly works as a queue, but if your config updates come in bursts of 100, I would batch them instead.

If you’re sending 100 concurrent config updates continuously (like, 100 every second or something, over the lifetime of the server, rather than in bursts), then you should find a way to limit

I think with more developer resources we can probably find ways to make config reloads even faster under pressure, but it’d be mostly heuristics and really specific optimizations for the most common use cases, like finding ways to correctly and safely reuse HTTP listeners. (Remember that to Caddy, a config is mostly a black box. It’s up to each individual module to provision itself and clean up after itself. And the HTTP server is one such module.)

Carter_Bryden · December 15, 2020, 11:17pm

Thank you, that’s a very helpful and thorough answer and makes perfect sense.

For my own use case, what I’ll do is implement a simple queue on my end, and use the tls on_demand feature to spread out the certificate operations (and because I won’t always be in control of the DNS for every domain). I might create a batching system later as well, though that’ll be a bit more work to coordinate.

francislavoie · December 15, 2020, 11:40pm

FWIW, I’ve mentioned this to Matt before, but I’d like to see JSON-RPC style batch request support for the admin API, so that you can send multiple config operations at once which Caddy could apply all together then do just one reload.

That would make it much simpler from a client standpoint to make batch config changes, cause then it wouldn’t need to care so much about the current config state (i.e. wouldn’t need to set up a lock, fetch the config from Caddy, modify it, then push it back and unlock).

So yeah, a JSON-RPC v2 endpoint in the Caddy API would be pretty slick IMO I’m pretty opinionated, but RPC > REST, IMO.

https://www.jsonrpc.org/specification

matt · December 15, 2020, 11:45pm

What if Caddy could optionally batch them implicitly, like, wait a few seconds before applying the configs? Just do 100 of the fast stuff before actually running it. Or something like that. Am mobile atm, sorry if that doesn’t make sense.

francislavoie · December 15, 2020, 11:46pm

Yeah that would be a reasonable approach as well - that would require passing a header like X-Caddy-Delay-Reload: true I guess, and it would need to return right away, and you wouldn’t hear whether it was successful right away.

matt · December 15, 2020, 11:56pm

Okay, at the computer for a sec, so I can elaborate.

What I was thinking is: optionally have Caddy update its internal representation of the current config during a burst, rather than applying the changes each time. The changes might only be applied at the end of the burst (maybe a request with a header that forces Caddy to apply the batched changes and to block until it is done), and that final request would get the results of the reload (i.e. whether it succeeded, or the error if it failed). The final response would then have a value indicating how many requests were batched together, i.e. if there was an error, this error applies to changes from the last N requests.

Or we could have a mode where Caddy implicitly queues config changes while a previous reload is happening, then applies them once it is finished. This is asynchronous, however, so you wouldn’t get the success or error results in your HTTP response.

So, there’s a few ideas that I don’t think would be terribly hard to implement.

But I do like the simplicity we have right now of just everything being synchronous and you get to control how hard you hit the API, you know every error, and the result of every request.

francislavoie · December 16, 2020, 12:01am

This advantage would be kept with a JSON-RPC approach to batching, FWIW.

system · January 14, 2021, 9:43pm

This topic was automatically closed after 30 days. New replies are no longer allowed.