TLS renewal fails with HTTP 400 urn:ietf:params:acme:error:malformed - JWS verification error

herkulessi · January 9, 2024, 2:43pm

1. The problem I’m having:

My Monitoring recently alerted me, that my certificates will expire in about 21 days. I thought caddy would handle it, but looked at the caddy logs to be sure and there were a bunch of errors about renewals failing with both let’s encrypt and zerossl.

2. Error messages and/or full log output:

It basically boils down to

Jan 09 14:45:00 proxy1 caddy[944950]: {"level":"error","ts":1704807900.171735,"logger":"tls.renew","msg":"could not get certificate from issuer","identifier":"<one of my domains>","issuer":"acme-v02.api.letsencrypt.org-directory","error":"HTTP 400 urn:ietf:params:acme:error:malformed - JWS verification error"}

Pastebin with full and long but redacted log output. I don’t think the actual domains or any client IPs matter here. Captured with debug enabled.

3. Caddy version:

v2.7.6 h1:w0NymbG2m9PcvKWsrXO6EEkY9Ru4FJK8uQbYcev1p3A=

4. How I installed and ran Caddy:

a. System environment:

neofetch output

       _,met$$$$$gg.          root@proxy1 
    ,g$$$$$$$$$$$$$$$P.       ----------- 
  ,g$$P"     """Y$$.".        OS: Debian GNU/Linux 12 (bookworm) x86_64 
 ,$$P'              `$$$.     Host: KVM/QEMU (Standard PC (i440FX + PIIX, 1996) pc-i440fx-8.1) 
',$$P       ,ggs.     `$$b:   Kernel: 6.1.0-17-amd64 
`d$$'     ,$P"'   .    $$$    Uptime: 7 days, 7 hours, 42 mins 
 $$P      d$'     ,    $$P    Packages: 553 (dpkg) 
 $$:      $$.   -    ,d$$'    Shell: bash 5.2.15 
 $$;      Y$b._   _,d$P'      Resolution: 1280x800 
 Y$$.    `.`"Y$$$$P"'         CPU: QEMU Virtual version 2.5+ (1) @ 1.992GHz 
 `$$b      "-.__              GPU: 00:02.0 Vendor 1234 Device 1111 
  `Y$$                        Memory: 102MiB / 457MiB 
   `Y$$.
     `$$b.                                            
       `Y$$b.                                         
          `"Y$b._
              `"""

Basically a standard Debian Bookworm install. caddy was installed via the caddy apt repositories and is run via systemd

b. Command:

Started by systemd.

$ cat /proc/$(pidof caddy) | xargs -0 echo
/usr/bin/caddy run --environ --config /etc/caddy/Caddyfile

c. Service/unit/compose file:

The default one supplied by the apt package.

# caddy.service
#
# For using Caddy with a config file.
#
# Make sure the ExecStart and ExecReload commands are correct
# for your installation.
#
# See https://caddyserver.com/docs/install for instructions.
#
# WARNING: This service does not use the --resume flag, so if you
# use the API to make changes, they will be overwritten by the
# Caddyfile next time the service is restarted. If you intend to
# use Caddy's API to configure it, add the --resume flag to the
# `caddy run` command or use the caddy-api.service file instead.

[Unit]
Description=Caddy
Documentation=https://caddyserver.com/docs/
After=network.target network-online.target
Requires=network-online.target

[Service]
Type=notify
User=caddy
Group=caddy
ExecStart=/usr/bin/caddy run --environ --config /etc/caddy/Caddyfile
ExecReload=/usr/bin/caddy reload --config /etc/caddy/Caddyfile --force
TimeoutStopSec=5s
LimitNOFILE=1048576
LimitNPROC=512
PrivateTmp=true
ProtectSystem=full
AmbientCapabilities=CAP_NET_ADMIN CAP_NET_BIND_SERVICE

[Install]
WantedBy=multi-user.target

d. My complete Caddy config:

No, that is not the full config, but my Config contains a LOT of credentials. This shortened Config contains all TLS-Relevant changes I did. All Server blocks are

subdomain.domain.tld {
        # A bunch of options, without any TLS options
}

{
        email <my email>
        on_demand_tls {
                ask https://<one of my domains>/<one of my domains>
        }
        servers {
                trusted_proxies static <a few subnets>
        }
        debug
}
localhost, 127.0.0.1, [::1] {
        tls internal
        respond "Hello World!"
}
https:// {
        tls internal
        respond "Hello Https!"
}
:443 {
        tls internal
        respond "Hello 443!"
}

I fail to see how the rest of the config would help diagnosing this issue, considering the very same config was used to get the certificates in the first place. I have a second machine with the exact same config, but without the issues. All my domain names are handled by two separate caddy instances.

5. Links to relevant resources:

Mohammed90 · January 9, 2024, 2:50pm

Are they both handling the same domain names? If yes, do they share the backing storage?

herkulessi · January 9, 2024, 3:23pm

Yes, they use exactly the same Config. The domains point at both servers. They do not share the backing storage, but I never had any issues with this kind of setup in the past.

Mohammed90 · January 9, 2024, 5:03pm

You were lucky. That’s all. If the 2 instances don’t share the backing storage, they cannot fulfill each other’s ACME challenges. What happens is:

server-1 initiates the renewal request
Let’s Encrypt agrees with server-1 on the challenge
server-1 presents the challenge resolution token
Let’s Encrypt calls the challenge resolution path on the domain example.com
DNS, by sheer bad luck, gives the IP address of server-2, because the domain resolves to both servers
Let’s Encrypt calls server-2 because of the DNS result
server-2 says “I don’t know what you’re talking about”, or perhaps gives a different challenge token for the renewal which it initiated on its own independently from server-1
Let’s Encrypt says, “well, the challenge failed. No renewal for you!”

If you’ll have multiple servers/hosts that serve the same domain names, they should have common storage area so they can know about each other’s actions and any of them is able to resolve the renewal challenge successfully. It worked before my sheer luck that the DNS resolution pointed at the right server that initiated the request.

The Correct® solution in this situation is to introduce common storage for both servers. You can use s3, SQL database, Consul, or any of the shared storage engines.

matt · January 9, 2024, 6:24pm

Mohammed is correct.

Just note that S3 doesn’t provide atomic ops, so it’s not truly safe in high-scale environments, but for 2 instances that aren’t running their renewal routines at the exact same time you might get lucky most of the time.

herkulessi · January 9, 2024, 9:37pm

I guess I will have to setup some kind of backing storage then…
Maybe I’ll host something like MinIO or something, none of my databases are reachable from the reverse Proxy

Thank you very much

(And if I did just get lucky, I got REALLY lucky because I am pretty sure have been running this for a few renew cycles now at this point)

francislavoie · January 9, 2024, 11:20pm

Basically just how many coin tosses you won in a row

herkulessi · January 10, 2024, 12:20pm

Could I share the filesystem folder via NFS? That way I wouldn’t need to recompile caddy, AND don’t have to setup minio…

francislavoie · January 10, 2024, 12:40pm

Probably, but you might get timing issues if it’s not super fast. The problem is NFS type storage mechanisms don’t have atomic writes (guaranteeing that other clients reading will see the written change before continuing) which Caddy relies on for creating files as locks. But it’ll still drastically reduce the chance of misbehaviour by using NFS.

(My personal recommendation is to use Redis for storage, simplest option) GitHub - pberkel/caddy-storage-redis

herkulessi · January 10, 2024, 12:46pm

Fair point, I think I’ll try NFS first because the NFS already exists and I can keep using the apt packages… But Redis or the like is a close second.

system · February 9, 2024, 12:47pm

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.