Timeout waiting for record to fully propagate

1. The problem I’m having:

Depending on which LAN network I run a Caddyfile (below), caddy/acme will timeout at the DNS TXT record verification step.

On the LAN that mis behaves the _acme-challenge record is successfully written to my route53 host (I can see it in the AWS console) but then acme/caddy is unable to confirm that with a lookup. I even tried to set a custom “resolver” to 8.8.8.8 and that did not help.

I don’t see this as a caddy issue. but maybe collective experience here can help me. I mean there is nothing wrong with my caddy container nor the Caddyfile (since it runs fine on another machine/LAN) but rather something is amiss with the process of how the DNS TXT is verified after being written into with this particular machine/LAN/gateway-router.

I guess if someone could explain how the lookup step is done I might be able to track down why this is happening. I mean is that lookup step done from acme servers or from my caddy instance out from my LAN directly? If I dig that TXT record from that machine/LAN in question it comes back with the correct value so why if I can do this via command line does it not work from caddy/acme.

; <<>> DiG 9.16.1-Ubuntu <<>> TXT _acme-challenge.admin.sj111.net
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 20483
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 512
;; QUESTION SECTION:
;_acme-challenge.admin.sj111.net. IN	TXT

;; ANSWER SECTION:
_acme-challenge.admin.sj111.net. 0 IN	TXT	"NuCBJY1ncyDidUAWgNQEz4c21rDrY9e7H74Y8D4FnFs"

;; Query time: 59 msec
;; SERVER: 192.168.8.1#53(192.168.8.1)
;; WHEN: Sun Feb 11 09:45:46 PST 2024
;; MSG SIZE  rcvd: 116

If I look it up from outside my LAN like this no problem it’s “there” almost immediately. TXT Record Lookup - Check Text Record (TXT) DNS records for any domain

2. Error messages and/or full log output:

{"level":"error","ts":1707412747.1206288,"logger":"tls.obtain","msg":"could not get certificate from issuer" ,"identifier":"*.admin.sj111.net","issuer":"acme-staging-v02.api.letsencrypt.org-directory","error":"[*.admin.sj111.net] solving challenges: waiting for solver certmagic.solverWrapper to be ready: timed out waiting for record to fully propagate; verify DNS provider configuration is correct - last error: <nil> (order=https://acme-staging-v02.api.letsencrypt.org/acme/order/135555693/14340715083) (ca=https://acme-staging-v02.api.letsencrypt.org/directory)"}

3. Caddy version:

v2.7.6 h1:w0NymbG2m9PcvKWsrXO6EEkY9Ru4FJK8uQbYcev1p3A=

4. How I installed and ran Caddy:

I run caddy in a docker container with custom image I build.
The image grabs the latest release rather then load via alpine packages.

a. System environment:

latest alpine 3.19

b. Command:

/opt/caddy/bin/caddy run --config test.conf --adapter caddyfile

btw that container is running on a host of ubuntu focal on an RPI4

services:
  caddy:
    container_name: ${NAME:-caddy}
    image: ${IMAGE:-caddy}
    # if no $CONF is given then Caddyfile in ${PWD}/conf:/opt/caddy/conf will be used
    command: caddy run ${CONF}
    hostname: ${NAME:-caddy}
    env_file:
      - $CREDENTIALS
    volumes:
      - data:/opt/caddy/data
      - settings:/opt/caddy/settings
      - conf:/opt/caddy/conf
      # - files:/opt/caddy/files
    restart: unless-stopped
    ports:
      - 80:80
      - 443:443
      - 2019:2019
# binding data and settings are not required
# But if there volumes are deleted caddy will need to redo all the certs
volumes:
  data:
  # driver_opts:
  #   type: none
  #   device: ${PWD}/data
  #   o: bind
  settings:
    # driver_opts:
    #   type: none
    #   device: ${PWD}/config
    #   o: bind
  # files:
  #   driver_opts:
  #     type: none
  #     device: /data/Hacking/webfiles
  #     o: bind
  conf:
    driver_opts:
      type: none
      device: ${PWD}/conf
      o: bind

d. My complete Caddy config:

{
    acme_ca https://acme-staging-v02.api.letsencrypt.org/directory
}

*.sj111.net, *.admin.sj111.net, *.dashboard.sj111.net {
	tls sj111.net@gmail.com {
		resolvers 8.8.8.8
		dns route53 {
			max_retries 10
		}
	}

	@docker host docker.sj111.net 
	handle @docker {
		reverse_proxy admin.111.net:9005
	}

}

The ACME issuer does its own DNS checks, but only after Caddy says “okay go ahead”.

And Caddy only says “go ahead” after Caddy itself can confirm that “yeah I can see that the DNS record is there”. This is called the “propagation check”.

The propagation check is optional, it’s not necessary to do. Caddy just does it as a sanity check. But you can turn it off, and it’ll move on and let the ACME issuer do its thing. Add propagation_timeout -1 to your tls config to turn off the checks.

The idea is that it tries to avoid pressure on ACME issuers by only telling them to go when ready to not waste their time/resources. But it can sometimes fail if from Caddy’s perspective, it can’t see the TXT records.

I don’t know why it’s not working with 8.8.8.8, in theory it should work.

@francislavoie based on what you said I determined if I can lookup the challenge record from inside the container running caddy and I can. So there goes that possible explanation.

Further I don’t understand why but if I do set propagation_timeout -1 then caddy never writes the challenge TXT entry to route53 host so your suggestion to try this fails. So is there a way to set propogation timeout so after the timeout acme will continue? As is it just throws and error and stops the process.


anyway thx for your input. I wish I could discover the difference between this machine/LAN and the one where this works. I currently don’t have another machine on that LAN in question where I can remote and run docker and my container so will have to make a trip there and try on a laptop so I can determine if the issue is

  • host os
  • version/setup of docker on that host
  • LAN gateway router

That makes no sense. The TXT record is written before propagation checks happen. :thinking:

I agree, but here are the errors when that is set to -1. If I don’t use that those don’t appear and the record is written and I’m back to the original propogation timeout error.

*.sj312.net {
	tls sj111.net@gmail.com {
		propagation_timeout -1
		dns route53 {
			max_retries 10
		}
	}
111-caddy | {"level":"info","ts":1707761221.745903,"logger":"http.acme_client","msg":"trying to solve challenge","identifier":"*.sj111.net","challenge_type":"dns-01","ca":"https://acme-staging-v02.api.letsencrypt.org/directory"}
111-caddy | {"level":"error","ts":1707761223.2379522,"logger":"http.acme_client","msg":"challenge failed","identifier":"*.sj111.net","challenge_type":"dns-01","problem":{"type":"urn:ietf:params:acme:error:unauthorized","title":"","detail":"No TXT record found at _acme-challenge.sj111.net","instance":"","subproblems":[]}}
111-caddy | {"level":"error","ts":1707761223.2381208,"logger":"http.acme_client","msg":"validating authorization","identifier":"*.sj111.net","problem":{"type":"urn:ietf:params:acme:error:unauthorized","title":"","detail":"No TXT record found at _acme-challenge.sj111.net","instance":"","subproblems":[]},"order":"https://acme-staging-v02.api.letsencrypt.org/acme/order/135555693/14461997903","attempt":1,"max_attempts":3}

at this point I won’t believe this is anything but some issue with the host OS/docker setup. That host OS/docker is a couple years old I need to get that updated before wasting any more time on this.

Okay, in that case you can also add propagation_delay 30s or something like that (not sure how long route53 needs) to have Caddy wait an amount of time instead of making DNS requests (use both timeout -1 to turn off the DNS check, and use delay to push it forwards in time)