TLS renewal fails with HTTP 400 urn:ietf:params:acme:error:malformed - JWS verification error

1. The problem I’m having:

My Monitoring recently alerted me, that my certificates will expire in about 21 days. I thought caddy would handle it, but looked at the caddy logs to be sure and there were a bunch of errors about renewals failing with both let’s encrypt and zerossl.

2. Error messages and/or full log output:

It basically boils down to

Jan 09 14:45:00 proxy1 caddy[944950]: {"level":"error","ts":1704807900.171735,"logger":"tls.renew","msg":"could not get certificate from issuer","identifier":"<one of my domains>","issuer":"acme-v02.api.letsencrypt.org-directory","error":"HTTP 400 urn:ietf:params:acme:error:malformed - JWS verification error"}

Pastebin with full and long but redacted log output. I don’t think the actual domains or any client IPs matter here. Captured with debug enabled.

3. Caddy version:

v2.7.6 h1:w0NymbG2m9PcvKWsrXO6EEkY9Ru4FJK8uQbYcev1p3A=

4. How I installed and ran Caddy:

a. System environment:

neofetch output
       _,met$$$$$gg.          root@proxy1 
    ,g$$$$$$$$$$$$$$$P.       ----------- 
  ,g$$P"     """Y$$.".        OS: Debian GNU/Linux 12 (bookworm) x86_64 
 ,$$P'              `$$$.     Host: KVM/QEMU (Standard PC (i440FX + PIIX, 1996) pc-i440fx-8.1) 
',$$P       ,ggs.     `$$b:   Kernel: 6.1.0-17-amd64 
`d$$'     ,$P"'   .    $$$    Uptime: 7 days, 7 hours, 42 mins 
 $$P      d$'     ,    $$P    Packages: 553 (dpkg) 
 $$:      $$.   -    ,d$$'    Shell: bash 5.2.15 
 $$;      Y$b._   _,d$P'      Resolution: 1280x800 
 Y$$.    `.`"Y$$$$P"'         CPU: QEMU Virtual version 2.5+ (1) @ 1.992GHz 
 `$$b      "-.__              GPU: 00:02.0 Vendor 1234 Device 1111 
  `Y$$                        Memory: 102MiB / 457MiB 
   `Y$$.
     `$$b.                                            
       `Y$$b.                                         
          `"Y$b._
              `"""

Basically a standard Debian Bookworm install. caddy was installed via the caddy apt repositories and is run via systemd

b. Command:

Started by systemd.

$ cat /proc/$(pidof caddy) | xargs -0 echo
/usr/bin/caddy run --environ --config /etc/caddy/Caddyfile

c. Service/unit/compose file:

The default one supplied by the apt package.
# caddy.service
#
# For using Caddy with a config file.
#
# Make sure the ExecStart and ExecReload commands are correct
# for your installation.
#
# See https://caddyserver.com/docs/install for instructions.
#
# WARNING: This service does not use the --resume flag, so if you
# use the API to make changes, they will be overwritten by the
# Caddyfile next time the service is restarted. If you intend to
# use Caddy's API to configure it, add the --resume flag to the
# `caddy run` command or use the caddy-api.service file instead.

[Unit]
Description=Caddy
Documentation=https://caddyserver.com/docs/
After=network.target network-online.target
Requires=network-online.target

[Service]
Type=notify
User=caddy
Group=caddy
ExecStart=/usr/bin/caddy run --environ --config /etc/caddy/Caddyfile
ExecReload=/usr/bin/caddy reload --config /etc/caddy/Caddyfile --force
TimeoutStopSec=5s
LimitNOFILE=1048576
LimitNPROC=512
PrivateTmp=true
ProtectSystem=full
AmbientCapabilities=CAP_NET_ADMIN CAP_NET_BIND_SERVICE

[Install]
WantedBy=multi-user.target

d. My complete Caddy config:

No, that is not the full config, but my Config contains a LOT of credentials. This shortened Config contains all TLS-Relevant changes I did. All Server blocks are

subdomain.domain.tld {
        # A bunch of options, without any TLS options
}
{
        email <my email>
        on_demand_tls {
                ask https://<one of my domains>/<one of my domains>
        }
        servers {
                trusted_proxies static <a few subnets>
        }
        debug
}
localhost, 127.0.0.1, [::1] {
        tls internal
        respond "Hello World!"
}
https:// {
        tls internal
        respond "Hello Https!"
}
:443 {
        tls internal
        respond "Hello 443!"
}

I fail to see how the rest of the config would help diagnosing this issue, considering the very same config was used to get the certificates in the first place. I have a second machine with the exact same config, but without the issues. All my domain names are handled by two separate caddy instances.

5. Links to relevant resources:

Are they both handling the same domain names? If yes, do they share the backing storage?

1 Like

Yes, they use exactly the same Config. The domains point at both servers. They do not share the backing storage, but I never had any issues with this kind of setup in the past.

You were lucky. That’s all. If the 2 instances don’t share the backing storage, they cannot fulfill each other’s ACME challenges. What happens is:

  • server-1 initiates the renewal request
  • Let’s Encrypt agrees with server-1 on the challenge
  • server-1 presents the challenge resolution token
  • Let’s Encrypt calls the challenge resolution path on the domain example.com
  • DNS, by sheer bad luck, gives the IP address of server-2, because the domain resolves to both servers
  • Let’s Encrypt calls server-2 because of the DNS result
  • server-2 says “I don’t know what you’re talking about”, or perhaps gives a different challenge token for the renewal which it initiated on its own independently from server-1
  • Let’s Encrypt says, “well, the challenge failed. No renewal for you!”

If you’ll have multiple servers/hosts that serve the same domain names, they should have common storage area so they can know about each other’s actions and any of them is able to resolve the renewal challenge successfully. It worked before my sheer luck that the DNS resolution pointed at the right server that initiated the request.

The Correct® solution in this situation is to introduce common storage for both servers. You can use s3, SQL database, Consul, or any of the shared storage engines.

2 Likes

Mohammed is correct. :100:

Just note that S3 doesn’t provide atomic ops, so it’s not truly safe in high-scale environments, but for 2 instances that aren’t running their renewal routines at the exact same time you might get lucky most of the time.

I guess I will have to setup some kind of backing storage then…
Maybe I’ll host something like MinIO or something, none of my databases are reachable from the reverse Proxy

Thank you very much

(And if I did just get lucky, I got REALLY lucky because I am pretty sure have been running this for a few renew cycles now at this point)

Basically just how many coin tosses you won in a row :rofl:

Could I share the filesystem folder via NFS? That way I wouldn’t need to recompile caddy, AND don’t have to setup minio…

Probably, but you might get timing issues if it’s not super fast. The problem is NFS type storage mechanisms don’t have atomic writes (guaranteeing that other clients reading will see the written change before continuing) which Caddy relies on for creating files as locks. But it’ll still drastically reduce the chance of misbehaviour by using NFS.

(My personal recommendation is to use Redis for storage, simplest option) GitHub - pberkel/caddy-storage-redis

1 Like

Fair point, I think I’ll try NFS first because the NFS already exists and I can keep using the apt packages… But Redis or the like is a close second.

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.