Caddy 2.7.0-beta.1

1. The problem I’m having:

Updating from 2.6.4 to 2.7.0-beta.1.
My attempt on our setup results in SSL for all domains that were previously working like a charm. Reverting to 2.6.4 fixes the issue.
Basically it looks like caddy does not find existing certificates and try to get new certificates for all domains.
There are maybe around 3000 domains certificates on the server, so I guess it would take ages to get a new cert for each (ratelimiting ,etc).

May I ask if it is supposed to happen. Like maybe there was some changes in how certificates are stored/indexed and maybe 2.7.x can’t load existing certs from 2.6.4 ?

Thanks for the infos !

2. Error messages and/or full log output:

Nothing special in the log except it seems it tries to get a NEW certificate for all previously existing domains. Also some HTTP/3 stuff, but I’m not sure it’s related, example:

{"level":"error","ts":1687146821.2925706,"logger":"http.log","msg":"setting HTTP/3 Alt-Svc header","error":"no port can be announced, specify it explicitly using Server.Port or Server.Addr"}

3. Caddy version:

v2.7.0-beta.1 h1:hKYXjAR/7Tn/DVfsu9j1ER8O1qLHh6163a7RoStRBXI=

4. How I installed and ran Caddy:

Homemade RPM package for our platform

a. System environment:

CloudLinux 8.x (AlmaLinux)

b. Command:

Started with systemd

c. Service/unit/compose file:

[Unit]
Description=Caddy
Documentation=https://caddyserver.com/docs/
After=network.target network-online.target
Requires=network-online.target

[Service]
Type=notify
User=caddy
Group=caddy
ExecStart=/usr/bin/caddy run --environ --config /etc/caddy/Caddyfile
ExecReload=/usr/bin/caddy reload --config /etc/caddy/Caddyfile --force
TimeoutStopSec=5s
LimitNOFILE=1048576
LimitNPROC=512
PrivateTmp=true
ProtectSystem=full
AmbientCapabilities=CAP_NET_BIND_SERVICE

[Install]
WantedBy=multi-user.target

d. My complete Caddy config:

{
        admin 127.0.0.1:8888
        default_bind 127.0.0.1 [::1] 10.111.20.10 [fdaa:beef:b00b:85::20:10]
        grace_period 3s
        log {
                output file /var/log/caddy/caddy.log {
                        roll_size 250MiB
                        roll_keep_for 15d
                }
                level ERROR
        }
        email letsencrypt@youwishyouknow.com
        acme_dns rfc2136 {
                key_name "dev.youwishyouknow.com"
                key_alg "hmac-sha512"
                key "crapkey"
                server "83.X.158.X:53"
        }
        on_demand_tls {
                ask https://api.youwishyouknow.com/caddy
                interval 2m
                burst 5
        }
        servers {
                trusted_proxies cloudflare {
                        interval 12h
                        timeout 15s
                }
        }
}

# Common options we want to apply to every "virtualhosts"
(common) {
        @sc_server_fqdn {
                path /_sc_get_server_fqdn
        }
        respond @sc_server_fqdn "dev.youwishyouknow.com" 200 {
                close
        }
        reverse_proxy http://127.0.0.80:80
}

# Default catchall endpoints
http:// {
        import common
}
https:// {
        import common
        tls {
                on_demand
                load /etc/caddy/certs
        }
}

# Hostname endpoint
http://dev.youwishyouknow.com {
        redir https://{host}{uri}
}
https://dev.youwishyouknow.com {
        # Imunify AV+ access restriction
        @imav_access {
                path /imav*
                not remote_ip 192.168.50.0/24 10.111.0.4
        }
        route @imav_access {
                respond "We're sorry, but this resource is not available for you. If you feed this is an error, please contact your amazing server administrator." 403 {
                        close
                }
        }
        import common
}

# LVE Manager endpoint
http://manager.dev.youwishyouknow.com {
        redir https://{host}{uri}
}
https://manager.dev.youwishyouknow.com {
        @manager_access {
                not remote_ip 192.168.50.0/24 10.111.0.4
        }
        route @manager_access {
                respond "We're sorry, but this resource is not available for you. If you feed this is an error, please contact your amazing server administrator." 403 {
                        close
                }
        }
        reverse_proxy http://127.0.0.1:9000
}

# IP endpoints
http://127.0.0.1, http://[::1], http://10.111.20.10, http://[fdaa:beef:b00b:85::20:10] {
        import common
}
https://127.0.0.1, https://[::1], https://10.111.20.10, https://[fdaa:beef:b00b:85::20:10] {
        import common
        tls internal
}

# Per virtualhost specific configs
import /etc/caddy/customers/*.conf

5. Links to relevant resources:

Thanks for trying the beta!

Did anything else about your setup change? Even unknowingly? Maybe permissions of the disk or anything related to the file system or service/user or anything else external to Caddy?

I will have a hard time reproducing this with the given config since it requires a few plugins and some config files that aren’t provided here. So I’ll need your help to narrow it down…

The full unredacted logs and unredacted configs will also help (this is a forum rule). Without that I don’t think there’s anything we can do beyond guessing.

If you reduce your config down to as minimal as possible that will help us troubleshoot faster. Especially for the HTTP/3 error.

Hello Matt,

Thanks a lot for your reply. I think I will need to reproduce it here on the production machine. In order to avoid too much trouble with the users, I will schedule this during a night, something like:

  • Stopping caddy
  • Clearing actual log
  • Switch to 2.7.0 (with loglevel set to debug and a minimal config)

And let it run for a few minutes, and grab the resulting logs…

What I would expect is that it starts answering to SSL requests right away (for already existing certificates)

What I saw last attempt was that it started requesting new certs for every domain and hits ratelimit rather quickly, so almost no domain was answering to SSL queries.

Let’s see if this happens again in my next attempt. Will let you know.

Kind regards

We just released beta 2 yesterday. You can try that.

It really sounds like the storage/disk isn’t finding the existing certificates. Possibly misconfigured somehow?

I will rebuild the package with beta 2 and give it a try. Thanks.

About the misconfiguration for the certificate storage, I do not have any specific configuration for the certificate storage and we are starting caddy with the exact same config.
But I guess the next attempt with debug log level could give a hint about what is going on.

Kind regards

1 Like

Hello Matt, Francis,

While comparing 2.6.4 and 2.7.0-beta2 logs in the (failed) attempt this night, it looks to me that 2.7.0 is now, for on-demand tls, doing some new checks such as calling the ASK endpoint and increasing a ratelimit counter even before loading an existing certificate from the storage.

It does this for every cert it wants to load when the first request for that domain is received.

For the few first domains it goes well, but then it quickly hits an internal on-demand ratelimit:

First certs loading logs:

{"level":"debug","ts":1687474831.5856535,"logger":"http.handlers.reverse_proxy","msg":"upstream roundtrip","upstream":"127.0.0.80:80","duration":0.604641736,"request":{"remote_ip":"190.226.134.246","remote_port":"55546","client_ip":"190.226.134.246","proto":"HTTP/1.1","method":"GET","host":"www.divespiritfakarava.com","uri":"/","headers":{"Accept":["text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7"],"Referer":["https://www.google.com/"],"Accept-Encoding":["gzip, deflate"],"Upgrade-Insecure-Requests":["1"],"User-Agent":["Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36"],"X-Forwarded-Proto":["http"],"Accept-Language":["en-GB,en;q=0.9,es;q=0.8,gl;q=0.7"],"X-Forwarded-For":["190.226.134.246"],"X-Forwarded-Host":["www.divespiritfakarava.com"]}},"headers":{"Cache-Control":["no-store, no-cache, must-revalidate, post-check=0, pre-check=0"],"Pragma":["no-cache"],"Set-Cookie":[],"Content-Length":["0"],"Content-Type":["text/html; charset=utf-8"],"Date":["Thu, 22 Jun 2023 23:00:30 GMT"],"Server":["Apache/2.4.37 () Phusion_Passenger/6.0.14"],"X-Powered-By":["PHP/8.2.5"],"Expires":["Wed, 17 Aug 2005 00:00:00 GMT"],"Location":["https://www.divespiritfakarava.com/"],"Last-Modified":["Thu, 22 Jun 2023 23:00:31 GMT"]},"status":301}
{"level":"debug","ts":1687474831.5881243,"logger":"tls","msg":"response from ask endpoint","domain":"www.cybermind.ch","url":"https://api.swisscenter.com/webservices.php/caddy/dnslookup?domain=www.cybermind.ch","status":200}
{"level":"debug","ts":1687474831.5881457,"logger":"tls.handshake","msg":"all external certificate managers yielded no certificates and no errors","remote_ip":"40.77.167.206","remote_port":"58201","sni":"www.cybermind.ch"}
{"level":"debug","ts":1687474831.5883646,"logger":"tls","msg":"loading managed certificate","domain":"www.cybermind.ch","expiration":1690876585,"issuer_key":"acme-v02.api.letsencrypt.org-directory","storage":"FileStorage:/var/lib/caddy/.local/share/caddy"}
{"level":"debug","ts":1687474831.5885167,"logger":"tls.cache","msg":"added certificate to cache","subjects":["www.cybermind.ch"],"expiration":1690876585,"managed":true,"issuer_key":"acme-v02.api.letsencrypt.org-directory","hash":"d2cf8d10538aacb4eb68619f395a0afd1240bfbb17601517bf24068650e0a745","cache_size":15,"cache_capacity":10000}
{"level":"debug","ts":1687474831.5885346,"logger":"events","msg":"event","name":"cached_managed_cert","id":"1a5149d3-f0c5-49ad-b341-4d30f130c11b","origin":"tls","data":{"sans":["www.cybermind.ch"]}}
{"level":"debug","ts":1687474831.5885465,"logger":"tls.handshake","msg":"loaded certificate from storage","remote_ip":"40.77.167.206","remote_port":"58201","subjects":["www.cybermind.ch"],"managed":true,"expiration":1690876585,"hash":"d2cf8d10538aacb4eb68619f395a0afd1240bfbb17601517bf24068650e0a745"}

Then after a few seconds it reaches the limit:

{"level":"debug","ts":1687474833.0693004,"logger":"tls","msg":"response from ask endpoint","domain":"swissnow.ch","url":"https://api.swisscenter.com/webservices.php/caddy/dnslookup?domain=swissnow.ch","status":200}
{"level":"debug","ts":1687474833.0694168,"logger":"http.stdlib","msg":"http: TLS handshake error from 94.103.96.129:51869: certificate is not allowed for server name swissnow.ch: decision func: on-demand rate limit exceeded"}

I was able to confirm and workaround the issue by raising the on-demand “burst” value to 10000 in the Caddyfile so it doesn’t hit the limit.
I guess having to do this is not a good idea though.

Looks like some changes in 2.7.0 makes this rate limit counter increase every time a cert is loaded through the on-demand “routine”, even when the cert already exists in the storage.

After 5 minutes of uptime the ASK endpoint was called 466 times (on ~3000 possible domains)

At first I thought it was trying to issue a request for these certificates, but no request is made. After the call to ASK (and if the rate limit is not reached) the certificate is finally loaded from the storage.

If this is the expected behavior, I can think of some possible side effects (even when not hitting the limit):

  • If too much pressure is applied on the ASK endpoint when re(starting) caddy and it starts answering erroneously with something else than HTTP 200, could the certificate never load, or worse, be removed from local storage ?
  • If the ASK endpoint takes for example 1-2 seconds to answer and a lot of requests are made when re(starting) caddy, would this delay the loading of these existing certificates ?

Kind regards.

PS: I can send you unredacted raw logs privately if needed. But really, I would rather not post them publicly on the forum. You know, GDPR and other stuff like scammers scraping websites/forums to gather list of domains for their dirty biz :confused:

PS2: Other than that, 2.7.0-beta2 seems great. I replaced the “realip” module we were using with the new “trusted_proxies cloudflare”. Works like a charm! Congrats and thanks a lot for this amazing server.

1 Like

Ah, yes, that is a new change. The idea is to reduce I/O and CPU time: if a certificate isn’t even allowed for a domain, don’t load it from storage in the first place. Before 2.7, we only performed the “ask” check just before trying to obtain a new (or renew an existing) cert. This resulted in lots of unnecessary I/O and CPU costs (loading and decoding the useless certificate). Now we short-circuit all that.

I hadn’t considered the effect it would have on that throttle, however. I wonder if the throttle should be ignored if there’s an “ask” endpoint set. (And even deprecated, since now “ask” is required.) Hmm.

Hopefully the ask endpoint is giving correct responses. I don’t know what else to do about that, really. If the certificate sits in storage for a long, long time and expires, it will be removed some time after it expires.

I don’t think so. I did re-jigger the synchronization here for 2.7, where the first request in those 1-2 seconds should obtain a lock, check storage (including the “ask”), while all the others for that domain wait, and then if the first one loaded the cert, all the others will use it immediately. If the first one didn’t load the cert, then the next one will try to load it, etc. Probably need to smooth that out a bit. Do you have a real life scenario where this is happening?

I can help in private for sponsors of a sufficient tier, so that’s an option if you would like to sign up for a sponsorship! Let me know if you want guidance on which tier to choose.

Cool, glad it’s working out!

Anyway you’ve given me some things to look into. I’d still be interested in more info about the performance if the throttle is removed.

Hi Matt,

Ah, yes, that is a new change. The idea is to reduce I/O and CPU time: if a certificate isn’t even allowed for a domain, don’t load it from storage in the first place. Before 2.7, we only performed the “ask” check just before trying to obtain a new (or renew an existing) cert. This resulted in lots of unnecessary I/O and CPU costs (loading and decoding the useless certificate). Now we short-circuit all that.

Okay, I get the idea behind this change.

Still I’m a bit worried though, if for any reason the ASK endpoint is unavailable (bogus, or in maintenance, or I don’t know what other good reason it could be) and in the meantime someone decides to restart caddy on some servers, then no existing on-demand certificate will be loaded and therefore all domains will be instantly unreachable until the ASK endpoint is back alive.

For me this it sounds a bit like a regression, in terms of behind conservative and avoiding single points of failure as that would bring down the whole service.

Especially when ASK endpoint is now mandatory. Until 2.7.x, one could just temporarily disable the endpoint if there are some trouble with it , so the service can at least resume with the valid certs it already has in the storage and accept some new ones.

Now I guess a very dirty workaround temporarily disable the endpoint would be to set it to any URL that always returns a 200.
It would then be a non-sense to make the endpoint mandatory if you can just get around it by pointing on any page to get rid of it, instead of just disabling it…

Anyway you’ve given me some things to look into. I’d still be interested in more info about the performance if the throttle is removed.

With these new changes, removing the throttle on checking/loading existing on-demand certificates seems mandatory, that’s what I did (I guess), by setting “burst” parameter to 10000 so it never hits the throttle.

Without that throttle removed, it would take more than an hour for caddy (serving many domains) to accept loading (from storage) all certificates being requested by clients connecting to the service. In the meantime the service is then not fully accessible.

With the throttle removed, there can still be an issue if the ASK endpoint is slow to answer and is getting overloaded with requests when caddy is restarted…

A “slow” ASK endpoint should not be a problem in normal conditions, but if it’s flooded with requests on caddy restart, it can become temporarily a problem delaying the service coming back to a normal situation.

I can help in private for sponsors of a sufficient tier, so that’s an option if you would like to sign up for a sponsorship! Let me know if you want guidance on which tier to choose.

I’m quite sure our “CEO” would not be against sponsoring a project like caddy. Feel free to tell me what options exists for this. We are a small hosting company though :slight_smile:

Kind regards

1 Like

Thanks for the follow-up!

Okay, so I just checked, and the throttle doesn’t take a “hit” until after the “ask” endpoint returns a 200:

That calls into question this observation:

Do you think we can verify that the throttle reaches capacity even when the ‘ask’ endpoint returns an error response? The logs above show that it’s returning 200, which will indeed increase the counter.

I know, but there’s not really another way to do it. Caddy has to know which certificates are allowed.

On-Demand is considered an advanced configuration, so I do expect that site owners can keep their ask endpoint online and snappy.

Not likely. We see in practice that most sites have a long tail of lesser-used certificates. Certificates at the end of that tail are the least likely to be in memory, while the most-used certificates are almost certainly already loaded from storage. As long as the server stays running, those certs should stay in memory until they need to be renewed.

That would be shooting yourself in the foot. :grimacing: (Why would you sabotage yourself?)

I do question the utility of the throttle these days, especially now that ‘ask’ is required.

But I’m still curious why it’s being depleted, unless your ask endpoint is always returning 200.

I wonder if we could find a way to keep certificates in memory through a config reload. Maybe a parameter that you set to true meaning to not reset the cache. (We have to reset it because we don’t know what the new configuration is. There’s not really a way to do a meaningful delta on them.) This way, your config reloads wouldn’t touch certificates at all and would just continue to use what is already there. I’d have to think on this though, it likely will cause problems…

That’d be great, we work with companies of all sizes! Here is a link to available tiers. All the higher tiers are customizable. Would be happy to work with your team.

Hello Matt,

I could for sure give it a try by changing the ASK url to something returning a 403.

Yes, of course, with best effort. Bad things can happen that can break the ASK enpoint, for example a maintenance on the database that is queried by the endpoint, or things like this. Hopefully it shouldn’t happen a lot :slight_smile:

Yes indeed. But as you said, as long as the server stays running. Once you restart it, it will hammer the ASK endpoint with requests as it will now need to revalidate every cert as they are getting loaded.
If it’s serving 10000 domains, and 30% of these domains will have a request in the first minutes of startup, it is hammering the ASK endpoint.
Hopefully in our case, the endpoint in answering rather quickly, but if some other people endpoints are taking >1 second to answer, I guess it could be a problem.

I would this only in case of emergency, if the caddy service need to be restarted and the endpoint is unavailable for some reason, we need a way to temporarily skip ASK. But for sure in normal operations it MUST be enabled.

I must admit that I don’t get the reasoning here.

Why should the ASK endpoint answer anything else than 200 except if the domain is not allowed ?
Actually with 2.7.x, after a caddy restart, for every existing certs it wants to load, caddy is doing a call to ASK endpoint, and therefore for most of them the answer is 200 (allowed).

So it is ok for the very few first certs loading, and then it hits the limit and stop loading the certs until the ratelimit expire.

Maybe I get something wrong here ?

Well that would improve a bit, but it still wouldn’t help for the occasions where a full caddy restart is done (full server reboot, eventually process crash, etc)

Thanks for the link, I’ll take a look at it with the boss :slight_smile:

Kind regards.

@r00tsh3ll Would you be able to give this PR a try?

It caches the ‘ask’ result for about an hour, so it should significantly reduce load on your ‘ask’ endpoint.

Hello Matt,

Thank you for your following up.

I can of course give a try with the PR and will do it.

However, I must admit that I don’t get how it could help with the original issue I’ve posted here.

Caching ASK results, for one hour, for the different on_demand certs requests would makes sense when the ASK endpoint returns a non-2XX for these requests (deny).
In that case it would indeed avoid hammering the ASK enpoint with retries for the same domain.

But the problem with the changes in 2.7.x is mainly related to the hundreds of ASK queries being made by caddy after a restart/reload for every already existing certificates it has on local storage.

If I’m getting it right, the on_demand certs loading flow before 2.7.x was:

  • HTTPS request received by caddy for domain.tld
    • If cert is in memory cache
      • Serve request
    • If it’s not in memory cache,
      • If corresponding cert is in the storage,
        • Load it, add it to memory cache
          • Serve request
      • If NO matching certificate is in storage
        • If on_demand ratelimit counter reached
          • abort and reschedule the attempt and return SSL error to browser
        • If on_demand ratelimit counter NOT reached
          • Call ASK endpoint to check if we can REQUEST a certificate
            • If allowed
              • Increment ratelimit counter
              • Request a certificate through corresponding issuers. Then save it to storage, load it in cache and use it.
                • Serve request
            • If not allowed
              • return SSL error to browser

So here, after a caddy start/restart/reload, most of the needed certs will be directly loaded from the storage and added to the cache when the first request matching their SNI comes in, which is great.

However now with 2.7.x:

  • HTTPS request received by caddy for domain.tld
    • If cert is in memory cache
      • Serve request
    • If NOT in memory cache
      • If on_demand ratelimit counter reached
        • Abort and reschedule the attempt and return SSL error to browser
      • If on_demand ratelimit counter NOT reached
        • Call ASK endpoint to check if we can LOAD EXISTING or REQUEST a certificate
          • If allowed
            • Increment ratelimit counter
            • If a corresponding cert is in the storage
              • load it, add it to memory cache, then use it
                • Serve request
            • If NO corresponding cert is in the storage
              • Rquest a certificate through corresponding issuers
                • Save it to storage, load it in cache and use it.
                  • Serve request

Then main issue here then is that when starting/restarting/reloading caddy it will:

  • Call ASK endpoint before loading any cert existing certs from storage to memory. This can then generate a burst of ASK requests on startup, on a server serving thousands of domains.
  • The ASK endpoint will for most requests return 200, as it will mostly be only concerning legit requests for which certificates already exists.
  • As I think you told me, the on_demand ratelimit is increaed every time the ASK endpoint returns 200 so it will quickly hit the ratelimit and prevent other certs to be loaded right away, therefore throw SSL errors to clients for the time caddy was not yet able to load all needed certificates.

To be honest, I’m not really sure that calling ASK endpoint before loading any certificate already existing in the storage but not in memory is the best idea.
I understand the goal was to reduce IO but looks to me it generates more trouble than benefits (especially when storage nowadays are quite fast…)

I’m sorry if I am not good at explaining the issue. I’m trying hard though.

King regards

1 Like

Thank you for the detailed writeup! Let me do my best to reply and we’ll figure this out.

In 2.7, the ask endpoint is checked first. The rate limiter only applies if the endpoint returns 200.

However, I can see now how that could be a problem: the name is approved, then the reservation is made in the rate limiter, then the certificate may be loaded from storage. Obviously rate limiting loading certificates from storage is not great :sweat_smile: I’ll see what I can do about that…

Ok, yes, now we’re on the same page.

So, the main issue is the rate limiter gets in the way, yeah?

Since ‘ask’ is now mandatory, I wonder if we should simply deprecate the rate limiter.

Yeah, no you’re right – we’ll get this fixed before 2.7. This is why we tag betas :upside_down_face:

@r00tsh3ll I’ve drafted a PR which drastically lightens the config reloads, so that the certificate cache isn’t purged each time:

Since the cache won’t be emptied as part of a config change, CertMagic will find the cert in the cache and won’t need to hit storage OR ‘ask’. Should just keep humming along.

Please give it a shot if you can :blush:

Hello Matt,

Sorry for the late reply, I was a bit busy with other projects these previous days.

Thank you for the info and the PR. I will give it a try ASAP !

Kind regards

1 Like

Hello @matt,

I finally had some free time to test this change and I can confirm that, after a reload, caddy now uses the certificates in cache instead of loading it again from storage.

That’s a great improvment, thanks :+1:

Kind regards

2 Likes

Oh yay!! Thanks for trying it out and I’m glad we were able to find a good solution. :100:

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.