Many "unable to obtain ARI lock" errors and connection timeouts

1. The problem I’m having:

Every now and then when I look through the logs I see lots of errors that say “unable to obtain ARI lock: context deadline exceeded”.

Sometimes we have the server go down with connection timeouts (it comes back up in under a minute or so), and when that happens I usually see a massive amount of these errors in the logs, though I’m not sure if that’s just coincidence or not because even during the time I see these logs it can respond just fine.

Within the space of 5 minutes I can see 1,200 of these error logs spread across 19 different domains. One domain takes up roughly 900 of those errors. We have something in the range of 150 domains all using on-demand TLS.

Is there anything I can/should do about this, and could this be what’s causing the server to not respond with connection timeouts, or is this just a coincidence?

2. Error messages and/or full log output:

{"level":"error","ts":"2025-10-07T04:18:51.2198329+01:00","logger":"tls.on_demand","msg":"updating ARI","identifiers":["DOMAIN_HERE"],"server_name":"DOMAIN_HERE","error":"unable to obtain ARI lock: context deadline exceeded"}

3. Caddy version:

v2.10.2 h1:g/gTYjGMD0dec+UgMw8SnfmJ3I9+M2TdvoRL/Ovu6U8=

4. How I installed and ran Caddy:

Unzipped the exe from the GitHub releases, run with:
caddy.exe run --config C:\path\to\Caddyfile

a. System environment:

Windows 2022 Server Standard 21H2, Caddy is just running as a service with the previous command

b. Command:

caddy.exe run --config C:\path\to\Caddyfile

c. Service/unit/compose file:

N/A

d. My complete Caddy config:

{
	log {
		format json {
			time_local
			time_format rfc3339_nano
			duration_format string
		}
	}

	email EMAIL_ADDRESS_HERE

	on_demand_tls {
		ask http://localhost:7071/api/v1/meta/caddy/ask
	}

	storage file_system {
		root "C:/server/caddy/.caddy"
	}
}

import "sites/*"

Each site looks like this:

DOMAIN_HERE {
	tls {
		on_demand
	}

	encode zstd gzip

	reverse_proxy local.hyperv.vm1:APPLICATION_PORT
}

5. Links to relevant resources:

N/A

Interesting – if you make a connection, do you observe that it completes successfully, and if so, does this error appear for that connection?

If this error is appearing during TLS handshakes, and the clients disconnect almost immediately, I could see this error appearing.

It’s pretty difficult to get a reliable test out of this, since I can’t just trigger it whenever I like.

As far as I can tell in the tests I have managed to do requesting a domain that’s actively outputting these errors does complete successfully in the majority of cases; usually there are between 5 and 20 or so errors before it stops, so my assumption would just be that it’s resolving itself fairly quickly and not taking up too much time.

If I look at the history of outages that tend to last under a minute then I do consistently see massive amounts of these errors, usually in the range of hundreds to thousands in the space of a minute or two for some domains trying to get an ARI lock. During those times when I’ve checked I’ve always gotten an immediate disconnect; no long waits or anything like that.
But again, I can’t definitively tie the outages to this error, it’s just what I happen to notice from the logs.

It’s difficult to give much concrete information on this since the error is just a context deadline being exceeded, and I can’t just trigger the error myself whenever I like.

Is there anything I could do to create a more meaningful test, or get more useful output that might shed some light on the source of the problem?
I’m fine with patching Caddy and building from source to try anything out if needed.

EDIT:

I’ve just had a thought that it could be Windows Defender scanning the lock files and causing the context deadline to be exceeded.

I’ve added the caddy folders to the exclusion list for Windows Defender.
I seem to be getting far less “unable to obtain ARI lock” errors now (1 or 2 instead of hundreds…), which is encouraging.

The main errors now seem to be “unable to release ARI lock” every now and then, which is just related to os.Remove() erroring because the file is still in use (“The process cannot access the file because it is being used by another process.”).

I’ll keep an eye on things, but hopefully Windows Defender was the source of the context deadline issue…

1 Like

Well, good detective work – and thanks for experimenting to troubleshoot!

My suspicion is that during a flood of short-lived connections, the context is canceled which cancels the waiting for a lock for ARI updates.

This might all be normal and expected under those circumstances. Only 1 needs to succeed to update the ARI info, and the contexts are being canceled, presumably because the clients are disconnecting anyway, so they have no need for the ARI update.

Keep me posted if you discover anything else.

So I’ve been keeping an eye on the logs, and although excluding the caddy folders from Windows Defender has helped a bit I’m still seeing a build up of ARI lock context deadline exceeded errors; still reaching 10k+ errors within an hour or two in some cases.

Server outages (connection timeouts) didn’t happen all the time even before the AV exclusion, so I’ll just have to wait and see if that happens again, but even without connection timeouts happening this still seems excessive, to the point that I’d still like to find a fix.

I rebuilt Caddy with some extra log info to see how long aquireLock was taking, and every time it blocks for 8 minutes before failing and returning an error, so we have thousands upon thousands of goroutines just taking up system resources and hammering on the disk for 8 minutes before they all die having achieved nothing.

I’m going to mess with the certmagic code and see if I can figure something out, but I have been wondering, why are the locks using the file system?
Does Caddy need to sync between separate instances or something like that, or was it just a convenient thing to use that you already had access to when you wrote the code?
If the locks have to be done with the file system then I’ll stick to changes that continue with that strategy.

Yeah, the locks are to ensure multiple instances of Caddy – and within the same instance of Caddy – operations are synchronized.

What file system is it? Ext4?

Would be happy to have your help investigating then.

It’s a Windows Server 2022 machine so it’s using NTFS for the file system.

Although it doesn’t address the underlying issue I’ve implemeted a TryLock on the Locker interface to see how that goes. Seeing as ARI isn’t a mandatory thing and all call sites just log and carry on when there’s an error anyway it seems like a good fit for a TryLock to me.

I’ll keep this patch running over the next week or two, but would you be open to a TryLock patch in certmagic if this fixes our issues?

Also, is there anything you’d like me to change and test whilst I’m doing these tests?

1 Like

I’d be open to it, although I have reservations. Will know better once I see the code, and hear about the results. Thanks for collaborating!

That’s good to hear.

I’ll keep an eye on things and get what data I can from the logs this week.
If everything continues to stay stable as it has been since I made the changes I’ll open an issue and/or a PR to make a case for the changes and discuss further next week.

1 Like

After keeping an eye on the logs and other monitoring tools we have for the servers I feel like the changes I’ve made have made everything stable for us.

The only changes I’ve made are in certmagic, but of course if you were open to merging those changes in then the version of certmagic used in Caddy would also need to be updated.

Should I just start with an issue/PR in certmagic sometime next week and go from there?

1 Like

Yes! That would be great, thank you!

I’ve made an initial certmagic PR here: Implement `TryLock` for use with optional tasks like ARI updates to reduce lock contention by polyscone · Pull Request #357 · caddyserver/certmagic · GitHub

I’ll continue any discussion over there when you or any other maintainers have the time.

1 Like

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.