ACME auto-ssl suddenly stopped working

Hi,

We are running a setup that uses the automatic SSL provisioning for a few thousand domains. Last night with no alterations to config, it seems it stops attempting to initiate a challenge.

The log just shows “Obtaining bundled SAN certificate” over and over again for a lot of different domains, with no further details.

I’ve tried to enable a higher verbosity but couldn’t find such an option.

How could we proceed to debug this issue?

2019/08/29 06:31:53 [INFO] [a.example] acme: Obtaining bundled SAN certificate
2019/08/29 06:31:53 [INFO] [a.example] AuthURL: https://acme-v02.api.letsencrypt.org/acme/authz-v3/[redacted]
2019/08/29 06:31:53 [INFO] [a.example] acme: Could not find solver for: tls-alpn-01
2019/08/29 06:31:53 [INFO] [a.example] acme: use http-01 solver
2019/08/29 06:31:53 [INFO] [a.example] acme: Trying to solve HTTP-01
2019/08/29 06:31:59 [INFO] [a.example] The server validated our request
2019/08/29 06:31:59 [INFO] [a.example] acme: Validations succeeded; requesting certificates
2019/08/29 06:43:00 [INFO] [a.example] Server responded with a certificate.
2019/08/29 06:43:00 [INFO] [b.example] acme: Obtaining bundled SAN certificate
2019/08/29 06:43:00 [INFO] [b.example] AuthURL: https://acme-v02.api.letsencrypt.org/acme/authz-v3/[redacted]
2019/08/29 06:43:00 [INFO] [b.example] acme: use tls-alpn-01 solver
2019/08/29 06:43:00 [INFO] [b.example] acme: Trying to solve TLS-ALPN-01
2019/08/29 06:43:06 [INFO] [b.example] The server validated our request
2019/08/29 06:43:06 [INFO] [b.example] acme: Validations succeeded; requesting certificates
2019/08/29 06:43:07 [INFO] [b.example] Server responded with a certificate.
2019/08/29 06:43:07 [INFO] [c.example] acme: Obtaining bundled SAN certificate
2019/08/29 06:43:07 [INFO] [c.example] AuthURL: https://acme-v02.api.letsencrypt.org/acme/authz-v3/[redacted]
2019/08/29 06:43:07 [INFO] [c.example] acme: use tls-alpn-01 solver
2019/08/29 06:43:07 [INFO] [c.example] acme: Trying to solve TLS-ALPN-01
2019/08/29 06:43:12 [INFO] [c.example] The server validated our request
2019/08/29 06:43:12 [INFO] [c.example] acme: Validations succeeded; requesting certificates
2019/08/29 06:43:13 [INFO] [c.example] Server responded with a certificate.
2019/08/29 06:43:13 [INFO] [www.d.example] acme: Obtaining bundled SAN certificate
2019/08/29 06:43:14 [INFO] [www.d.example] AuthURL: https://acme-v02.api.letsencrypt.org/acme/authz-v3/[redacted]
2019/08/29 06:43:14 [INFO] [www.d.example] acme: use tls-alpn-01 solver
2019/08/29 06:43:14 [INFO] [www.d.example] acme: Trying to solve TLS-ALPN-01
2019/08/29 06:43:22 [INFO] [www.d.example] The server validated our request
2019/08/29 06:43:22 [INFO] [www.d.example] acme: Validations succeeded; requesting certificates
2019/08/29 06:43:23 [INFO] [www.d.example] Server responded with a certificate.
2019/08/29 06:43:23 [INFO] [e.example] acme: Obtaining bundled SAN certificate
2019/08/29 06:43:23 [INFO] [f.example] acme: Obtaining bundled SAN certificate
2019/08/29 06:43:24 [INFO] [g.example] acme: Obtaining bundled SAN certificate
2019/08/29 06:43:24 [INFO] [h.example] acme: Obtaining bundled SAN certificate
2019/08/29 06:43:24 [INFO] [i.example] acme: Obtaining bundled SAN certificate
2019/08/29 06:43:24 [INFO] [www.j.example] acme: Obtaining bundled SAN certificate
2019/08/29 06:43:25 [INFO] [j.example] acme: Obtaining bundled SAN certificate
2019/08/29 06:43:25 [INFO] [k.example] acme: Obtaining bundled SAN certificate
2019/08/29 06:43:25 [INFO] [l.example] acme: Obtaining bundled SAN certificate
2019/08/29 06:43:25 [INFO] [m.example] acme: Obtaining bundled SAN certificate
2019/08/29 06:43:26 [INFO] [n.example] acme: Obtaining bundled SAN certificate
2019/08/29 06:43:26 [INFO] [o.example] acme: Obtaining bundled SAN certificate
2019/08/29 06:43:26 [INFO] [p.example] acme: Obtaining bundled SAN certificate
2019/08/29 06:43:27 [INFO] [q.example] acme: Obtaining bundled SAN certificate
2019/08/29 06:43:27 [INFO] [r.example] acme: Obtaining bundled SAN certificate
2019/08/29 06:43:27 [INFO] [s.example] acme: Obtaining bundled SAN certificate
2019/08/29 06:43:27 [INFO] [www.t.example] acme: Obtaining bundled SAN certificate
2019/08/29 06:43:28 [INFO] [www.u.example] acme: Obtaining bundled SAN certificate
2019/08/29 06:43:28 [INFO] [www.v.example] acme: Obtaining bundled SAN certificate
2019/08/29 06:43:28 [INFO] [www.w.example] acme: Obtaining bundled SAN certificate
2019/08/29 06:43:28 [INFO] [shop.x.example] acme: Obtaining bundled SAN certificate
2019/08/29 06:43:29 [INFO] [y.example] acme: Obtaining bundled SAN certificate
2019/08/29 06:43:29 [INFO] [www.z.example] acme: Obtaining bundled SAN certificate
[...]

After this point in time, there are no further acme challenges attempts or anything else than “Obtaining bundled SAN certificate” repeated.

After I deleted around 150 domains from /root/.caddy/acme/acme-v02.api.letsencrypt.org/sites, some certificates are now being issued again (but for other domains it keeps showing only Obtaining bundled SAN certificate).

I wonder if it’s hitting some maximum.

I checked that number of file descriptors is set to something high:

ulimit -n
131072

It resumed working by itself after around 12 hours, with no change to the config. Caddy was restarted a few times, I don’t know if that had any impact.

I think it was related to a high influx of new domains, so a workaround could be to gradually migrate domains over instead of many at once.

We are seeing it again in another situation where around 500 new domains were migrated to the Caddy server. Caddy will stop trying to issue new certificates, only showing “acme: Obtaining bundled SAN certificate” in the logs.

This happens when a lock file is placed, but something prevents it from being cleaned up. Usually a forceful process termination/restart or similar. Do you have multiple Caddy instances in a cluster configured to share the same storage? Were any processes terminated forcefully?

1 Like

It doesn’t seem to be. It’s the only caddy process, and it has been running for several weeks, and some certs do get issued. Storage is plain files on a local disk.

Can this lock be cleared manually?

Yes, if you’re sure no other instances are running and doing ACME operations, you can remove the $CADDYPATH/locks folder (default is $HOME/.caddy).

It doesn’t seem to change the behaviour. The locks folder is continuously being populated for a short while (5-10s) with files related to the domains it is showing “acme: Obtaining bundled SAN certificate” for in the logs. But it doesn’t seem to get to the other steps in the acme process.

Once in a while, a domain will get to the “AuthURL: https://acme-v02.api.letsencrypt.org/acme/authz-v3” step.

It could just be network latency then.

To the Lets Encrypt endpoint you mean? It seems their cloudflare node is very close to us:

ping acme-v02.api.letsencrypt.org
PING ca80a1adb12a4fbdac5ffcbc944e9a61.pacloudflare.com (172.65.32.248) 56(84) bytes of data.
64 bytes from 172.65.32.248 (172.65.32.248): icmp_seq=1 ttl=61 time=0.355 ms

The connection is a almost non-utilized 10 Gbit.
Thanks for all your answers BTW!

Currently, Caddy locks ACME operations so that only one happens at a time. (That will go away in the future but we need to be sure to get the throttling right first.) So, it’s not surprising that your 10 Gbit line is “non-utilized”, but there is evidently some delay in creating the authz with Let’s Encrypt.

I’d recommend switching to the staging endpoint and doing some tests to see if you can figure out what is causing the delays.

A few hours later, I’m getting “acme: Validations succeeded; requesting certificates” again. So it seems it again happens only when we migrate over a larger number of domains (such as 500). I think one acme operation at a time should still be sufficient for the influx, since each only takes around 10 s, but maybe the way they are queued could be the problem?

I just posted in another thread about LetsEncrypt’s 300 new orders per 3 hours rate limit for the v2 API… But I’d have expected Caddy to show an actual ACME error in that case.

Thanks for the replies! For now we’ll try and take it easy :slight_smile:
I do wonder if renewal spikes could similarly occur, though.

If this is a rate limit issue, renewals don’t count; they aren’t new orders, and they’re individually rate limited at 5/week, so if I understand correctly you can renew all of them at once, five times in a week, without issue.

I just ran into this issue. I’m running 3 caddy instances with the redis clustering plugin.
I migrated about 50 domains where maybe 40 were subdomains of the same domain.

I then suddenly near the end get Obtaining bundled SAN certificate with no error message.

I then tried with another domain that wasn’t part of the migration, same message.
then I tried that other domain on a test system that’s setup the same way, and there it went through.

So I’m thinking it’s some kind of ip rate limit instead of a domain rate limit.

There are no locks hanging in redis, and a lock shows up when it’s trying to get one, and gets removed when it stops. Which I would expect it to do.

It’s been about 20 hours since I did the migration.

@sunebeck did you ever figure out why and how to fix it? Or is it still a waiting game?

What are your configurations? (Full and complete, without redactions, please)

Caddyfile:

:80 {
  proxyprotocol x.x.x.x/16

  redir 301 {
    if {path} not /caddyping
    https://{host}{uri}
  }

  status 200 /caddyping
}

:443 {
  proxyprotocol x.x.x.x/16

  tls {
    ask http://ask-caddy-api/query
  }

  proxy / proxyservice:80 {
    transparent
    websocket
  }
}

env:

CADDY_CLUSTERING=redis
CADDY_CLUSTERING_REDIS_TLS=true
CADDY_CLUSTERING_REDIS_PORT=xxxx
CADDY_CLUSTERING_REDIS_HOST=xxxx
CADDY_CLUSTERING_REDIS_PASSWORD=xxxx

command:
caddy -email EMAIL -agree

Perfect, thanks! (sigh, you still redacted stuff, but in this case it fortunately has little to do with TLS – usually this gets in the way of our debugging though, please don’t redact in the future!)

CertMagic currently has a lock so that only 1 cert operation happens at a time (this will go away in the future, replaced instead by a rate limiter that allows bursts of requests but still has a hard limit over a sliding window), so that lock is acquired, and then…

The log lines [*] acme: Obtaining bundled SAN certificate are emitted directly by lego, here: https://github.com/go-acme/lego/blob/bc4b57accc090b9c61bde051c99fcb14e952f6e6/certificate/certificates.go#L97

It simply seems that that function is never returning. It might be worth filing a bug report with lego because it should probably emit more logs here to help us understand what is going on.

Yeah It would help with some error logging.

I forgot to mention that I tried to restart the caddy instances as well (removing and creating new containers in docker, so completely fresh state), still the same.

I’ll file a bug report with lego.