Restricting times of day Caddy can attempt certificate renewals

mscdex · June 18, 2024, 3:46am

1. The problem I’m having:

I’m considering evaluating Caddy but one issue I have is that I need certificate renewals to happen between certain times of day to avoid any potential issues when taking filesystem snapshots. Either that or at the very least a way to only manually trigger renewal checks at runtime (so I can schedule via cron).

2. Error messages and/or full log output:

N/A

3. Caddy version:

2.8.4

4. How I installed and ran Caddy:

Not installed yet

a. System environment:

Linux

b. Command:

N/A

c. Service/unit/compose file:

N/A

d. My complete Caddy config:

N/A

5. Links to relevant resources:

N/A

mscdex · June 20, 2024, 7:44am

I suppose an alternative question is are the certificates stored on disk replaced atomically (e.g. not copied)? If so, then I don’t need to worry about the time of day or manually triggering renewal.

matt · June 20, 2024, 8:20pm

Interesting… can’t say we’ve had this request before.

Thus, there’s not any built-in timeframes that are configurable. What you could do, I guess, is configure the renewal scan interval in the tls app’s automation config: JSON Config Structure - Caddy Documentation

Then start the server at a time that aligns with when you’re OK with doing renewals, so that the renewal timer doesn’t tick during the snapshot window.

They are overwritten when they are replaced. The files may be read at any time but Caddy will only write them once when they’re renewed.

mscdex · June 20, 2024, 9:35pm

So I take that to mean the certificate (and any other state I guess) is not atomically replaced? My problem is that ZFS snapshots happen at the block level so it’s possible to end up with a snapshot that contains a partially written file, which would cause problems if such a snapshot were restored and Caddy were (re)started.

Perhaps there is some workaround for this that exists already, like having Caddy load the certificate from one path but have it save renewed certificates to a different path and have a post-renewal hook that atomically moves the renewed certificate to the real path (either via hard link or rename syscall)?

matt · June 20, 2024, 9:45pm

Atomic from what perspective? From Caddy’s perspective it’s atomic because cert management is synchronized. Only one writer updates the cert at a time. If that’s not what you mean then I think you’ll have to be more specific.

Does my suggested workaround not work?

mscdex · June 20, 2024, 9:47pm

From the perspective of the filesystem (ZFS in this case).

matt · June 20, 2024, 11:30pm

Oh I wouldn’t know. The default storage module (FileSystem) does not call ZFS-specific APIs, so in that sense, no it wouldn’t be “atomic”. But I really don’t know how it works or what is needed to make it atomic in that sense.

Whitestrake · June 21, 2024, 12:07am

As ZFS is a copy-on-write filesystem, a file being replaced is written to new blocks and then the pointer is updated to the new file once it’s finished.

This means that if you snapshot while an existing file is in the middle of being rewritten, you should expect that the old version of the file will be represented in the snapshot rather than half of the old one and half of the new one.

ZFS snapshots are taken at the block level, but they happen on the blocks currently indicated by the file system.

mscdex · June 21, 2024, 3:16am

Even with that, the same problem exists at a higher level since as far as I understand it Caddy will replace/update other files besides the certificate, including the private key as part of the renewal process. All of those files will need to be updated atomically in order for things to work properly.

So while a snapshot may not contain a partial file, it may contain a new certificate but an old private key for example, something that I would like to avoid.

Whitestrake · June 21, 2024, 3:28am

It is true that you could end up with some files updated and not others, but I have to question whether that’s really a problem.

The philosophy of Caddy’s approach to certificate management supports the idea that certs are not necessarily something that should be backed up, or should be relied on to have been backed up. When you start Caddy it will maintain its certificate store, and as part of that it’ll requisition new ones as required. The goal of this approach is to effectively treat the certificates as short-lived and disposable as possible.

Also - these renewals happen once just a smidge over 60 days apiece if everything goes as planned (Caddy attempts renewals once a certificate has fewer than 30 days of validity remaining, if I’m not mistaken). What exactly is the scope of your managed certificates, and how prolific is your your snapshot cadence, that you’re worried you’ll intersect the middle of certificate maintenance for a given certificate?

I’d argue that if you have a very fast snapshot cadence (minutes, or even hours), you’ll have relevant snapshots of all the required files in the odd chance that you intersect cert maintenance. If you have a moderate or relaxed snapshot cadence, the chances of this occurring are so minuscule as to not be worth engineering a solution for. The window we’re talking about must be sub-second, surely, to write a few small files at once. And again, on the odd chance you do - you can revert the certificates to the older consistent snapshot and simply have Caddy renew them all again automatically for you, which makes the impact of this very unlikely problem to be not really worth engineering a solution for.

I’m open to learning of a use case that changes my understanding on that, but I’m finding it hard to imagine one.

Whitestrake · June 21, 2024, 3:48am

On top of all the above, I’m not really sure how Caddy could reliably achieve simultaneous atomicity of multiple files in a filesystem.

A write-and-rename is the usual method of achieving an atomic update of an existing file, but write-and-rename isn’t atomic for multiple files on any filesystem I can think of. ZFS technically does this behind the scenes already as described above, as a COW filesystem - even when you write “directly” to the file - but obviously still isn’t atomic for multiple writes.

No, the more I think about it, the more I think that if this is a significant issue you need to not use the filesystem for certificate storage at all.

You should probably use one of the other storage providers and have that alternate provider (e.g. redis dump.rdb, postgres pg_dump, etc) deposit its data on the filesystem in a single transaction, as part of your snapshot script, to ensure full certificate store consistency.

mscdex · June 21, 2024, 4:13am

Generally speaking, the way I’ve solved that kind of problem in the past is by swapping “old” and “new” directories. That gives you a simple transaction-like setup on the filesystem.

I’ve actually done that already and the only relevant possibility would be the mysql storage adapter (as everything else would require additional software installation just to solve this one little problem). However to be honest after briefly looking at its code, I’m not sure I would trust it. I know zero when it comes to Go as a language so it’s not feasible for me to write my own storage adapter (which I would otherwise gladly do in situations like this).

francislavoie · June 21, 2024, 10:03pm

I really don’t think you need to worry about this. Even if you did lose your certs/keys, just start Caddy up again and it’ll issue new ones. It’s not a problem. The only concern is Let’s Encrypt or ZeroSSL rate limits, but you’re probably well within those limits (they’re quite generous).

mscdex · June 22, 2024, 12:09am

After some more thinking on this, I think I will just (gracefully) shut down Caddy and replace it with a simple “maintenance page” server while the snapshot is happening (since Caddy is only reverse proxying to a single server on localhost). As far as I know that should ensure everything on disk is always in a consistent state.

francislavoie · June 22, 2024, 12:44am

I dunno, that seems like a weird level of paranoia about all this. You’d really rather downtime? I still don’t understand what actual risk you think there is here. Like I said, even if your certs were lost and the backups happened to be corrupted (which I give like 0.0000001% chance of both things happening at the same time) you can just start Caddy back up with empty storage to reissue all the certs for free, with no harm done. There’s really no reason for any downtime.

mscdex · June 22, 2024, 2:45am

Yes, I’d rather have very minimal “downtime” every night than any additional unnecessary risk of ever having to manually intervene in case of a snapshot rollback. Neither solution is ideal, I’m just choosing what I see as the better of the options.

I think the ideal solution would be to have a way to disable Caddy’s automatic renewal checking and be able to manually and externally trigger the check and wait for it to finish (e.g. via Caddy’s admin API). So until such a feature exists, I will have to settle for the solution I’ve chosen.

I appreciate everyone’s input on the matter.

francislavoie · June 22, 2024, 3:14am

I don’t see why you’d need manual intervention. Issuance is fully automated.

mscdex · June 22, 2024, 4:04am

I was referring to the situation when rolling back to a snapshot with inconsistent state/certificate files. Besides detecting the situation, I would need to (as you said) manually empty storage and restart Caddy.

mscdex · June 22, 2024, 6:39am

On second thought, this might be something the events subsystem can help with. It looks like there are events emitted by certmagic that are passed on by Caddy before and after a renewal is attempted, so I could just have an event listener for each that executes a command in the foreground.

This would allow me to block the cert_obtaining until snapshotting is finished if one is in progress. It would also allow me to touch a file that gets removed on cert_obtained/cert_failed that can be checked for and waited on before starting a snapshot.

Can anyone think of any caveats (besides events being marked as experimental) or problems offhand with doing something like that?

francislavoie · June 22, 2024, 5:23pm

You’re really way over-engineering this. This is really an imaginary problem that you think exists but is so infinitesimally unlikely to occur that it’s not worth worrying about.