Too many failed authorization attempt errors

We are getting error message – too many failed attempts
btw I am not asking why this happens.

A

I wonder if Caddy already has OR will have some configuration keys to limit the number of attempts, just some ideas about the possible keys:

letsencrypt_request_attempts
letsencrypt_renew_attempts
letsencrypt_admin_email — when failed the max attempts, admin get an email notification
letsencrypt_admin_webhook_url — when failed POST error and debug info

B

I wonder if Caddy is using OR will use default letsencrypt folder path, ie: /etc/letsencrypt/*
(this would allow admin to manually debug using Letsencrypt directly in case Caddy run into related issue)

A:

Ultimately this depends on your Caddy configuration. Caddy has sane, limited retries in place, but these are CA-agnostic, and you can still hit Let’s Encrypt’s rate limits if you hit Caddy with a hammer.

B:

There is actually no “default letsencrypt folder path” – plus, Caddy is CA-agnostic. Where certificates are stored is entirely up to the client, and as Caddy’s docs say, they’re stored in the folder specified by your XDG_DATA_HOME folder or, if not set, $HOME/.caddy (or $HOME/.local/share/caddy in Caddy 2).

A.
Regarding “sane number”, unless you tell me it is less than 2. Anything above can be subjective. We restarted caddy twice only. Second time, we see the “Too many failed attempts”. In our test case, the IP is not pointing to the host, so we do expect it to fail.

IF Caddy has a sane number, the logic conclusion is that there is a bug(s) in the code.
After we got “Too Many Failed Attempts” error from Caddy, we are able to use certbot to acquire certs for the same domain on a different host with Properly A Record.

I’m just curious if there is any plan to add a max attempt count config key, in case a user doesn’t specify, it can use the default “sane number” Caddy Team pick. But it can be helpful or means a peace of mind for those who want to specify this.

B.
What you said is reasonable.
I intended to say “conventional path” for specific distro, but I can see how that may complicate the things. Thank you for the pointer for customizing the data folder.

I believe this is non-sequitur. Caddy can both have sane rate limits and a third-party ACME service provider can impose tight rate limits, and you can run into those without having buggy code. “Sane” doesn’t mean “will always prevent you from running into any arbitrary third party rate limits”.

Note the published documentation regarding LetsEncrypt’s own imposed rate limits, particularly:

There is a Failed Validation limit of 5 failures per account, per hostname, per hour. This limit is higher on our staging environment, so you can use that environment to debug connectivity problems.
https://letsencrypt.org/docs/rate-limits/

Unless you propagated the LetsEncrypt ACME account created by Caddy to your second host to be used by Certbot, I note it’s likely they allowed this due to the difference of ACME account.

I note also that if you expected the request to fail, LetsEncrypt point to their own staging environment for testing and integration purposes. In Caddy v1, you can use a CLI flag to make use of this: -ca https://acme-staging-v02.api.letsencrypt.org/directory - in v2 this is done while configuring the ACME module.

Matt has been working on rate limit/throttling pretty recently, particularly with respect to the New Orders Per Account Per 3 Hours rate limit (see: https://github.com/mholt/certmagic/commit/035d803533d024a05369cec5ab49830dfdf5553b). But 5 failures in an hour is a tight limit to work with, especially when Caddy is designed to try HTTP and TLS-ALPN challenges and fall back to the other in the event of a failure, for redundancy.

As for giving the ability to configure those numbers… Well, there’s some control over that in the TLS automation settings in Caddy v2: https://github.com/caddyserver/caddy/wiki/v2:-Documentation#tlsautomation. That might be a place to make more controls over throttling available moving forward.

@Whitestrake + @matt

Thank you very much for the detailed clarification!

Given provider imposed limits, I can see why throttling settings may not be so useful.
Would it be helpful if Caddy has some logic to verify host is “reachable” by configured domain prior to attempting Letsencrypt process? (however users should ensure this, but mistakes happens)

There are so many variables here that in short, the only way to ensure your host is reachable by LetsEncrypt is to be LetsEncrypt and try to connect. Other than that, only trying an ACME challenge is going to tell you definitively whether an ACME challenge will succeed or not.

Everything else is too affected by the local environment for any client-side automatic testing to be reliable, even if there were multiple indicators that could be used in concert to determine confidence. Matt’s written a bit about this recently in another thread, too:

And there’s also this draft for industry standard ACME client best practices that warns against relying on this kind of checking: https://github.com/https-dev/docs/blob/master/acme-ops.md#do-not-rely-on-pre-validation-checks

Essentially - you find me some logic that might be helpful, and I can probably find you a use case that renders that logic invalid (and possibly counterproductive).

1 Like

@Whitestrake + @matt

It’s very interesting to read your writing about the details and nuances. I think I have a better appreciation of the situation now. Thank you for answering questions and all the works done!

1 Like

I’m probably going to overlap some of what @Whitestrake already said (which I haven’t had time to read fully), but I started drafting this response days ago and was too busy to finish it until now, sorry about that.

Welllll… that’s wrong. :slight_smile: This is jumping to a conclusion and disregards much of the external factors that really have nothing to do with Caddy. I believe you definitely had that experience, but I am not convinced that it is caused by the reasons you think.

That’s possible, but unlikely, especially since:

There are a lot of reasons that this could happen, and such an obvious bug in Caddy’s cert management logic (after 4 years of continuous production use, managing millions of certificates) is toward the bottom of the list.

Can you please post the full logs and error messages (“too many failed attempts” is not a complete error message) from all your trials to get certificates for your domain name on any/all of your hosts in the relevant time period, including those from other ACME clients such as Certbot? That will help us come to an understanding of what really happened. Right now we’re just guessing. You seem convinced it is because of a bug in Caddy, but I need to be convinced too if I am going to be able to fix it; it’s impossible otherwise.

Please also post your full and complete (unredacted) Caddyfile, along with how you executed Caddy (so any systemd unit files, Dockerfiles, whatever else is relevant). This is also crucial in knowing where the problem lies.

Caddy already has this, and it depends how you run Caddy and how you configure its TLS, which is why I’m asking for the config so we can solve the problem if there is one within Caddy.

This is impossible to do without an external vantage point, and even then it’s no guarantee that the CA’s production servers would reach the same conclusion. I think Matthew linked to a document that explains this in more depth.