ARG CADDY_VERSION=2.5.2
FROM caddy:${CADDY_VERSION}-builder AS builder
RUN xcaddy build c7772588bd44ceffcc0ba4817e4d43c826675379 \
--with github.com/lucaslorentz/caddy-docker-proxy/plugin \
--with github.com/caddy-dns/cloudflare
FROM caddy:${CADDY_VERSION}-alpine
COPY --from=builder /usr/bin/caddy /usr/bin/caddy
CMD ["caddy", "docker-proxy"]
Launched the new image. Launched whoami. Same issue. Takes a while but finally resovles. Ditto on wikijs. I copied wikijs to a new config and launched with a subdomain I had not used before and it seems to behave the same.
For each of the three containers/subdomains launched, I observed it created two dns entries (_acme-challenge). It eventually deletes the first one and leaves the second. It creates two certs for each subdomain according to:
Darn. So, this only happens when the config changes right? If you just set up the one (whoami) and leave it run, it works successfully? (Starting clean slate, i.e. no lingering DNS records, etc.)
If so, I might see about patching Caddy/CertMagic today so that it will always clean up even if the context is cancelled.
Good news: It is cleaning up perfectly the DNS entries. I watched very close and there were never multiple challenge entries. And they were cleaned after cert was issued… When a cert was issued…
Launched whoami first. took about ~ 6-7 mins, about 3 iterations of challenges, counting by each new DNS challenge entry created and removed. It was issued by ZeroSSL. After many runs, this is only the 3rd time one was issued from ZeroSSL instead of LE.
Launched wiki1. WikiJS container renamed to a subdomain I haven’t used before. Same as above. About 6-7 mins, about 3 iterations. Issued by ZeroSSL (???) I don’t have a paid account there (yet) and I thought it was limited to 4? But dashboard shows 4 certs (wiki, catz, whoami, wiki1).
Launched kitties (mikesir87/cats) container. Haven’t used that subdomain before. After 30 mins, 9+ iterations, it still has not issued a cert, still trying.
Why did none of these succeed on LE where most of the others had before? I see now on the certificate search where it shows two for each subdomain that one is a pre-certificate and one a leaf certificate which answers my question about why two certs are issued per subdomain, but why are certs not reacquired when one has previously been issued for a subdomain? Why me? lol. Only slightly kidding there. I am wondering why am I facing this problem? Am I doing something odd or non-standard? Surely others are running a similar configuration. What am I doing differently that caused this breakage?
… I was interrupted while composing this. It’s been well over an hour waiting for a cert on that last subdomain now and it still hasn’t , though I haven’t updated the linked logs above since nothing else has changed.
Yeah, their website is very misleading, but no, you have unlimited free certs with ZeroSSL, when issued via ACME.
Wow, I really don’t understand why the config is being reloaded so often. I’d honestly call that a bug in the swarm provider in CDP, probably.
I feel like it shouldn’t add the site to the config unless there’s a container that’s actually up, cause it causes the config to get loaded, then immediately get reloaded which cancels the initial config’s, etc.
But I can see the argument for it both ways, cause like maybe there is some config that should be provisioned even if there is no container running yet, and you might want to have like a handle_errors block which does whatever else when you don’t have a container available to handle it. I dunno. Bah.
Maybe CDP should like… debounce config updates to potentially group them up so it doesn’t try to reload a ton of times? Hmmmm
{"level":"error","ts":1660846756.82023,"logger":"tls.obtain","msg":"could not get certificate from issuer","identifier":"kitties.mysmarthome.network","issuer":"acme-v02.api.letsencrypt.org-directory","error":"[kitties.mysmarthome.network] solving challenges: waiting for solver certmagic.solverWrapper to be ready: timed out waiting for record to fully propagate; verify DNS provider configuration is correct - last error: <nil> (order=https://acme-v02.api.letsencrypt.org/acme/order/684992797/117414919027) (ca=https://acme-v02.api.letsencrypt.org/directory)"}
Anyways, seems like Caddy timed out on the propagation checks. This can happen if Caddy itself isn’t able to resolve the DNS properly to find out if the TXT record was successfully added by the plugin, before telling the ACME issuer “okay, it should be good now, please continue”. We do have an option to turn this off, but unfortunately it can’t be configured globally, it must be configured with the tls directive (in each site), and it must be per-issuer, so it ends up looking like this:
I’ve launched a few containers with this config and so far they have all acquired certs and come up immediately. More testing to confirm, but seems to be DNS issue. I wonder if it’s due to my network config? DNS points to my Mikrotik router which is configured to use Cloudflare’s safe 1.1.1.3 over DoH. Will have to experiment with that to see if it changes. After more testing to confirm that all is working now.
EDIT: One thing I don’t understand though is why it seemed to work in my previous testing that didn’t use caddy-docker-proxy. Maybe luck. Or is it related to the plug-in?
Docker has its own DNS resolver layer which is what resolves container names to container IPs. That might not be playing nice in this case, preventing Caddy from seeing changes to the TXT records.
I don’t have a great grasp of what exactly Docker is causing to happen with DNS queries. It’s been a mystery to me.
If I had my way, we’d remove propagation checks altogether… I don’t think they’re useful at all, it doesn’t actually do anything, it just delays the cert issuance process until Caddy “knows” that the DNS is correct so that the ACME issuer doesn’t waste time polling for the DNS challenge. But in general, I don’t think that’s necessary. Some other DNS providers are known to take way too long to propagate changes done via their API (iirc GoDaddy was problematic) but most are quite rapid. So IMO it should be opt-in for the delay, instead of opt-out. But we’ll see. Hopefully I can convince @matt sooner rather than later
I ran a few more tests today. Not extensive enough to be 100% sure, but when used the certs were acquired nearly immediately, and when I backed off the changes it went back to spinning its wheels. Here are things that did work:
Setting Dockers DNS option in /etc/docker/daemon.json: "dns": ["1.1.1.1"]
Setting the DHCP server on my router so that my Docker host’s dns is set to 1.1.1.1 (and I presume manually setting /etc/resolver.conf would work also to but mine is set to DHCP and I didn’t see a need to test that.)
Any of those work. Having my router’s DHCP point my Docker host back to the router and then the router going straight to 1.1.1.1, without the DoH, etc. doesn’t work.
I guess this is good enough for me to consider closed on my end and move on to the next challenge (probably Cloudflared tunnel). If there are any tests that I didn’t think to run, let me know. Appreciate the help more than I can say. I’ve toyed around with a few programs for reverse proxy and I’m feeling extremely confident this is the one I’m sticking with.