Restricting times of day Caddy can attempt certificate renewals

1. The problem I’m having:

I’m considering evaluating Caddy but one issue I have is that I need certificate renewals to happen between certain times of day to avoid any potential issues when taking filesystem snapshots. Either that or at the very least a way to only manually trigger renewal checks at runtime (so I can schedule via cron).

2. Error messages and/or full log output:

N/A

3. Caddy version:

2.8.4

4. How I installed and ran Caddy:

Not installed yet

a. System environment:

Linux

b. Command:

N/A

c. Service/unit/compose file:

N/A

d. My complete Caddy config:

N/A

5. Links to relevant resources:

N/A

I suppose an alternative question is are the certificates stored on disk replaced atomically (e.g. not copied)? If so, then I don’t need to worry about the time of day or manually triggering renewal.

Interesting… can’t say we’ve had this request before.

Thus, there’s not any built-in timeframes that are configurable. What you could do, I guess, is configure the renewal scan interval in the tls app’s automation config: JSON Config Structure - Caddy Documentation

Then start the server at a time that aligns with when you’re OK with doing renewals, so that the renewal timer doesn’t tick during the snapshot window.

They are overwritten when they are replaced. The files may be read at any time but Caddy will only write them once when they’re renewed.

1 Like

So I take that to mean the certificate (and any other state I guess) is not atomically replaced? My problem is that ZFS snapshots happen at the block level so it’s possible to end up with a snapshot that contains a partially written file, which would cause problems if such a snapshot were restored and Caddy were (re)started.

Perhaps there is some workaround for this that exists already, like having Caddy load the certificate from one path but have it save renewed certificates to a different path and have a post-renewal hook that atomically moves the renewed certificate to the real path (either via hard link or rename syscall)?

Atomic from what perspective? From Caddy’s perspective it’s atomic because cert management is synchronized. Only one writer updates the cert at a time. If that’s not what you mean then I think you’ll have to be more specific.

Does my suggested workaround not work?

1 Like

From the perspective of the filesystem (ZFS in this case).

Oh I wouldn’t know. The default storage module (FileSystem) does not call ZFS-specific APIs, so in that sense, no it wouldn’t be “atomic”. But I really don’t know how it works or what is needed to make it atomic in that sense.

1 Like

As ZFS is a copy-on-write filesystem, a file being replaced is written to new blocks and then the pointer is updated to the new file once it’s finished.

This means that if you snapshot while an existing file is in the middle of being rewritten, you should expect that the old version of the file will be represented in the snapshot rather than half of the old one and half of the new one.

ZFS snapshots are taken at the block level, but they happen on the blocks currently indicated by the file system.

2 Likes

Even with that, the same problem exists at a higher level since as far as I understand it Caddy will replace/update other files besides the certificate, including the private key as part of the renewal process. All of those files will need to be updated atomically in order for things to work properly.

So while a snapshot may not contain a partial file, it may contain a new certificate but an old private key for example, something that I would like to avoid.

It is true that you could end up with some files updated and not others, but I have to question whether that’s really a problem.

The philosophy of Caddy’s approach to certificate management supports the idea that certs are not necessarily something that should be backed up, or should be relied on to have been backed up. When you start Caddy it will maintain its certificate store, and as part of that it’ll requisition new ones as required. The goal of this approach is to effectively treat the certificates as short-lived and disposable as possible.

Also - these renewals happen once just a smidge over 60 days apiece if everything goes as planned (Caddy attempts renewals once a certificate has fewer than 30 days of validity remaining, if I’m not mistaken). What exactly is the scope of your managed certificates, and how prolific is your your snapshot cadence, that you’re worried you’ll intersect the middle of certificate maintenance for a given certificate?

I’d argue that if you have a very fast snapshot cadence (minutes, or even hours), you’ll have relevant snapshots of all the required files in the odd chance that you intersect cert maintenance. If you have a moderate or relaxed snapshot cadence, the chances of this occurring are so minuscule as to not be worth engineering a solution for. The window we’re talking about must be sub-second, surely, to write a few small files at once. And again, on the odd chance you do - you can revert the certificates to the older consistent snapshot and simply have Caddy renew them all again automatically for you, which makes the impact of this very unlikely problem to be not really worth engineering a solution for.

I’m open to learning of a use case that changes my understanding on that, but I’m finding it hard to imagine one.

2 Likes

On top of all the above, I’m not really sure how Caddy could reliably achieve simultaneous atomicity of multiple files in a filesystem.

A write-and-rename is the usual method of achieving an atomic update of an existing file, but write-and-rename isn’t atomic for multiple files on any filesystem I can think of. ZFS technically does this behind the scenes already as described above, as a COW filesystem - even when you write “directly” to the file - but obviously still isn’t atomic for multiple writes.

No, the more I think about it, the more I think that if this is a significant issue you need to not use the filesystem for certificate storage at all.

You should probably use one of the other storage providers and have that alternate provider (e.g. redis dump.rdb, postgres pg_dump, etc) deposit its data on the filesystem in a single transaction, as part of your snapshot script, to ensure full certificate store consistency.

2 Likes

Generally speaking, the way I’ve solved that kind of problem in the past is by swapping “old” and “new” directories. That gives you a simple transaction-like setup on the filesystem.

I’ve actually done that already and the only relevant possibility would be the mysql storage adapter (as everything else would require additional software installation just to solve this one little problem). However to be honest after briefly looking at its code, I’m not sure I would trust it. I know zero when it comes to Go as a language so it’s not feasible for me to write my own storage adapter (which I would otherwise gladly do in situations like this).

1 Like

I really don’t think you need to worry about this. Even if you did lose your certs/keys, just start Caddy up again and it’ll issue new ones. It’s not a problem. The only concern is Let’s Encrypt or ZeroSSL rate limits, but you’re probably well within those limits (they’re quite generous).

2 Likes

After some more thinking on this, I think I will just (gracefully) shut down Caddy and replace it with a simple “maintenance page” server while the snapshot is happening (since Caddy is only reverse proxying to a single server on localhost). As far as I know that should ensure everything on disk is always in a consistent state.

I dunno, that seems like a weird level of paranoia about all this. You’d really rather downtime? I still don’t understand what actual risk you think there is here. Like I said, even if your certs were lost and the backups happened to be corrupted (which I give like 0.0000001% chance of both things happening at the same time) you can just start Caddy back up with empty storage to reissue all the certs for free, with no harm done. There’s really no reason for any downtime.

1 Like

Yes, I’d rather have very minimal “downtime” every night than any additional unnecessary risk of ever having to manually intervene in case of a snapshot rollback. Neither solution is ideal, I’m just choosing what I see as the better of the options.

I think the ideal solution would be to have a way to disable Caddy’s automatic renewal checking and be able to manually and externally trigger the check and wait for it to finish (e.g. via Caddy’s admin API). So until such a feature exists, I will have to settle for the solution I’ve chosen.

I appreciate everyone’s input on the matter.

I don’t see why you’d need manual intervention. Issuance is fully automated.

1 Like

I was referring to the situation when rolling back to a snapshot with inconsistent state/certificate files. Besides detecting the situation, I would need to (as you said) manually empty storage and restart Caddy.

On second thought, this might be something the events subsystem can help with. It looks like there are events emitted by certmagic that are passed on by Caddy before and after a renewal is attempted, so I could just have an event listener for each that executes a command in the foreground.

This would allow me to block the cert_obtaining until snapshotting is finished if one is in progress. It would also allow me to touch a file that gets removed on cert_obtained/cert_failed that can be checked for and waited on before starting a snapshot.

Can anyone think of any caveats (besides events being marked as experimental) or problems offhand with doing something like that?

You’re really way over-engineering this. This is really an imaginary problem that you think exists but is so infinitesimally unlikely to occur that it’s not worth worrying about.

1 Like