Error "TLS alert, internal error (592)" (again)

Ok, I have a theory. This is a fun one, so try to keep up:

The in-memory cert cache fills up (maximum 10,000 let’s say), but an on-demand cert is needed for trail.pixelpenguinmedia.live. The cert for img.lemlist.com is chosen by random to be evicted, and is replaced by the cert for trail.pixelpenguinmedia.live which was already in storage.

Immediately after the certs are swapped in the cache (which happens atomically as a single critical section), maintenance occurs on the new cert because the cert for trail.pixelpenguinmedia.live is expired and must be renewed. However, the “ask” endpoint returns a non-200 status of 404, so Caddy is not allowed to obtain/renew that certificate. Thus, the cert is immediately removed from the cache as quickly as it was loaded.

So a cache that had 10,000 certs in it now has 9,999 because the 10,001st cert (trail.pixelpenguinmedia.live) didn’t actually replace img.lemlist.com – only for a moment. We removed one cert, added a different one, but then removed the different one without replacing the one we evicted to make room for it.

Most unfortunately, my patch was not 100% useful in this (IMO) unusual edge case, as I restricted loading certificates from storage that were not configured for on-demand (img.lemlist.com) to only be loaded from storage if the cache was completely full:

…which is sound logic, because a cert that is not managed “on-demand” should be loaded only at startup, and the only reason it would disappear from the cache is if the cache was full, to make room for another cert, should it draw the short straw.

…EXCEPT when a cert is removed for a reason other than the cache being full (for example, a non-200 response from the “ask” endpoint).

:man_facepalming:

Does that make sense?

I think the easiest fix is for me to remove that “cache full” condition and just allow any cert to be loaded from storage at any time, regardless of on-demand being enabled or the cache being full. This might incur a slight performance hit for servers that are getting a lot of spammy handshakes especially if their storage is slow, but I think it would only slow down handshakes that are asking for certs that don’t exist or aren’t allowed: it shouldn’t slow down handshakes for recognized, managed hostnames. I think. Right?

(In this particular chunk of logs, it appears you were as lucky as you were unlucky: the reason img.lemlist.com got back into the cert cache was because of a lot of requests for trail.pixelpenguinmedia.live at the same time, so for a brief moment the cache was 100% full, so the condition was satisfied, so it re-loaded the cert. Hence the “only” 6 second period where your alert triggered.)

3 Likes