Hard time getting a response on a DNS-01 challenge

1. Caddy version (caddy version):

v2.4.6 h1:HGkGICFGvyrodcqOOclHKfvJC0qTU7vny/7FhYp9hNw=

Built using xcaddy

.\xcaddy.exe build --with github.com/caddy-dns/glesys

2. How I run Caddy:

Running from command line using caddyfile.

a. System environment:

Windows 10 x64
Running from a PowerShell terminal

b. Command:

.\caddy.exe run -watch

c. Service/unit/compose file:

d. My complete Caddyfile or JSON config:


(glesys) {
	tls {
		issuer acme {
			email "<secret>"
			propagation_timeout "20m"
			resolvers "ns1.namesystem.se"
			dns glesys {
				project "<secret>"
				api_key "<secret>"
			}
		}
	}
}

*.kapi.kmpm.dev {
	import glesys
	handle {
		respond "Site not served from here"
	}
}

3. The problem Iā€™m having:

The DNS lookup that caddy tries to do to create the certificate takes forever to be successfull and it only works after several retries. Eventually it goes through and creates a certificate.

Why canā€™t caddy get a response when Resolve-DnsName (or dig) can. I have seen the same with 2 different DNS providers but have no explanation as to why?

4. Error messages and/or full log output:

This is not a full log output but itā€™s the relevant parts.

2022/04/22 08:29:46.913 INFO    tls.issuance.acme.acme_client   trying to solve challenge       {"identifier": "*.kapi.kmpm.dev", "challenge_type": "dns-01", "ca": "https://acme-v02.api.letsencrypt.org/directory"}
2022/04/22 08:49:49.126 ERROR   tls.obtain      could not get certificate from issuer   {"identifier": "*.kapi.kmpm.dev", "issuer": "acme-v02.api.letsencrypt.org-directory", "error": "[*.kapi.kmpm.dev] solving challenges: waiting for solver certmagic.solverWrapper to be ready: timed out waiting for record to fully propagate; verify DNS provider configuration is correct - last error: <nil> (order=https://acme-v02.api.letsencrypt.org/acme/order/507916187/82291298347) (ca=https://acme-v02.api.letsencrypt.org/directory)"}
2022/04/22 08:49:49.127 INFO    tls.obtain      releasing lock  {"identifier": "*.kapi.kmpm.dev"}
2022/04/22 08:49:49.129 ERROR   tls     job failed      {"error": "*.kapi.kmpm.dev: obtaining certificate: [*.kapi.kmpm.dev] Obtain: [*.kapi.kmpm.dev] solving challenges: waiting for solver certmagic.solverWrapper to be ready: timed out waiting for record to fully propagate; verify DNS provider configuration is correct - last error: <nil> (order=https://acme-v02.api.letsencrypt.org/acme/order/507916187/82291298347) (ca=https://acme-v02.api.letsencrypt.org/directory)"}

5. What I already tried:

  • I have verified that the required _acme-challenge records are created using the providers API.
  • I have checked that I can get the challenge record on the same computer as caddy (using Resolve-DnsName and dig in WSL)
  • I have watched the DNS traffic using wireshark and can see that caddy does a large amount of queries that all get a ā€œNo such nameā€ response but that Resolve-DnsName works during the same time.

When trying to use the built in Resolve-DnsName in PowerShell I get a proper response from the server but caddy doesnā€™t.
I have looked at it using wireshark and I have a dump of a successful response using Resolve-DnsName with a failed one from caddy a short while later. There is a difference in that OPT is set on the caddy query but there might be something else that matters as well.

The following is the output from wireshark from a query-response using Resolve-DnsName as well as one triggered by caddy.

Request from using Resolve-DnsName -Server ns1.namesystem.se -Type txt _acme-challenge.kapi.kmpm.dev

Frame 1: 89 bytes on wire (712 bits), 89 bytes captured (712 bits) on interface \Device\NPF_{51E5A288-6AE3-4E1E-97AE-CB93A4EAEA02}, id 0
Ethernet II, Src: HewlettP_2f:5b:5d (ec:b1:d7:2f:5b:5d), Dst: Ubiquiti_cd:18:59 (b4:fb:e4:cd:18:59)
Internet Protocol Version 4, Src: 172.16.10.23, Dst: 195.238.76.18
User Datagram Protocol, Src Port: 25515, Dst Port: 53
Domain Name System (query)
    Transaction ID: 0x50c3
    Flags: 0x0100 Standard query
    Questions: 1
    Answer RRs: 0
    Authority RRs: 0
    Additional RRs: 0
    Queries
        _acme-challenge.kapi.kmpm.dev: type TXT, class IN
            Name: _acme-challenge.kapi.kmpm.dev
            [Name Length: 29]
            [Label Count: 4]
            Type: TXT (Text strings) (16)
            Class: IN (0x0001)
    [Response In: 2] 

Reply to Resolve-DnsName

Frame 2: 145 bytes on wire (1160 bits), 145 bytes captured (1160 bits) on interface \Device\NPF_{51E5A288-6AE3-4E1E-97AE-CB93A4EAEA02}, id 0
Ethernet II, Src: Ubiquiti_cd:18:59 (b4:fb:e4:cd:18:59), Dst: HewlettP_2f:5b:5d (ec:b1:d7:2f:5b:5d)
Internet Protocol Version 4, Src: 195.238.76.18, Dst: 172.16.10.23
User Datagram Protocol, Src Port: 53, Dst Port: 25515
Domain Name System (response)
    Transaction ID: 0x50c3
    Flags: 0x8500 Standard query response, No error
    Questions: 1
    Answer RRs: 1
    Authority RRs: 0
    Additional RRs: 0
    Queries
        _acme-challenge.kapi.kmpm.dev: type TXT, class IN
            Name: _acme-challenge.kapi.kmpm.dev
            [Name Length: 29]
            [Label Count: 4]
            Type: TXT (Text strings) (16)
            Class: IN (0x0001)
    Answers
        _acme-challenge.kapi.kmpm.dev: type TXT, class IN
            Name: _acme-challenge.kapi.kmpm.dev
            Type: TXT (Text strings) (16)
            Class: IN (0x0001)
            Time to live: 3600 (1 hour)
            Data length: 44
            TXT Length: 43
            TXT: QN0yBxJUfcKlhCno_wY_ZOCSTklHcNk8OwGCGUlz6ic
    [Request In: 1]
    [Time: 0.008665000 seconds]

Caddy Request

Frame 3: 100 bytes on wire (800 bits), 100 bytes captured (800 bits) on interface \Device\NPF_{51E5A288-6AE3-4E1E-97AE-CB93A4EAEA02}, id 0
Ethernet II, Src: HewlettP_2f:5b:5d (ec:b1:d7:2f:5b:5d), Dst: Ubiquiti_cd:18:59 (b4:fb:e4:cd:18:59)
Internet Protocol Version 4, Src: 172.16.10.23, Dst: 195.238.76.18
User Datagram Protocol, Src Port: 25516, Dst Port: 53
Domain Name System (query)
    Transaction ID: 0x7edb
    Flags: 0x0100 Standard query
    Questions: 1
    Answer RRs: 0
    Authority RRs: 0
    Additional RRs: 1
    Queries
        _acme-challenge.kapi.kmpm.dev: type TXT, class IN
            Name: _acme-challenge.kapi.kmpm.dev
            [Name Length: 29]
            [Label Count: 4]
            Type: TXT (Text strings) (16)
            Class: IN (0x0001)
    Additional records
        <Root>: type OPT
            Name: <Root>
            Type: OPT (41)
            UDP payload size: 4096
            Higher bits in extended RCODE: 0x00
            EDNS0 version: 0
            Z: 0x0000
                0... .... .... .... = DO bit: Cannot handle DNSSEC security RRs
                .000 0000 0000 0000 = Reserved: 0x0000
            Data length: 0
    [Response In: 4]

Response to Caddy

Frame 4: 169 bytes on wire (1352 bits), 169 bytes captured (1352 bits) on interface \Device\NPF_{51E5A288-6AE3-4E1E-97AE-CB93A4EAEA02}, id 0
Ethernet II, Src: Ubiquiti_cd:18:59 (b4:fb:e4:cd:18:59), Dst: HewlettP_2f:5b:5d (ec:b1:d7:2f:5b:5d)
Internet Protocol Version 4, Src: 195.238.76.18, Dst: 172.16.10.23
User Datagram Protocol, Src Port: 53, Dst Port: 25516
Domain Name System (response)
    Transaction ID: 0x7edb
    Flags: 0x8503 Standard query response, No such name
    Questions: 1
    Answer RRs: 0
    Authority RRs: 1
    Additional RRs: 1
    Queries
        _acme-challenge.kapi.kmpm.dev: type TXT, class IN
            Name: _acme-challenge.kapi.kmpm.dev
            [Name Length: 29]
            [Label Count: 4]
            Type: TXT (Text strings) (16)
            Class: IN (0x0001)
    Authoritative nameservers
        kapi.kmpm.dev: type SOA, class IN, mname ns1.namesystem.se
            Name: kapi.kmpm.dev
            Type: SOA (Start Of a zone of Authority) (6)
            Class: IN (0x0001)
            Time to live: 3600 (1 hour)
            Data length: 57
            Primary name server: ns1.namesystem.se
            Responsible authority's mailbox: registry.glesys.se
            Serial Number: 35
            Refresh Interval: 10800 (3 hours)
            Retry Interval: 2700 (45 minutes)
            Expire limit: 1814400 (21 days)
            Minimum TTL: 10800 (3 hours)
    Additional records
        <Root>: type OPT
            Name: <Root>
            Type: OPT (41)
            UDP payload size: 1232
            Higher bits in extended RCODE: 0x00
            EDNS0 version: 0
            Z: 0x0000
                0... .... .... .... = DO bit: Cannot handle DNSSEC security RRs
                .000 0000 0000 0000 = Reserved: 0x0000
            Data length: 0
    [Request In: 3]
    [Time: 0.006742000 seconds]

I do have a pcapng file from wireshark if neccesary.

6. Links to relevant resources:

Seems like people keep continually running into issues with certmagicā€™s DNS propagation checks :weary: Weā€™ll add a way to turn them off soon, cause they seem to often just be a speedbump when issuance wouldā€™ve worked without it.

The spot where it happens is in here, if it helps you figure out the debugging:

2 Likes

Thanks,
I will have a look during the weekend.

1 Like

made a really small DNS application that worksā€¦
The thing I did was to comment out m.SetEdns0(4096, false) in func createDNSMsg.
Now the code returns information while if I keep the SetEdns0 statement in it fails.
Donā€™t know yet why, or how important it is, have some reading up to do.

main.go

package main

import (
	"fmt"
	"strings"
	"time"

	"github.com/miekg/dns"
)

var dnsTimeout = 10 * time.Second

func dnsQuery(fqdn string, rtype uint16, nameservers []string, recursive bool) (*dns.Msg, error) {
	m := createDNSMsg(fqdn, rtype, recursive)
	var in *dns.Msg
	var err error
	for _, ns := range nameservers {
		in, err = sendDNSQuery(m, ns)
		if err == nil && len(in.Answer) > 0 {
			break
		}
	}
	return in, err
}

func createDNSMsg(fqdn string, rtype uint16, recursive bool) *dns.Msg {
	m := new(dns.Msg)
	m.SetQuestion(fqdn, rtype)
	//m.SetEdns0(4096, false)
	if !recursive {
		m.RecursionDesired = false
	}
	return m
}

func sendDNSQuery(m *dns.Msg, ns string) (*dns.Msg, error) {
	udp := &dns.Client{Net: "udp", Timeout: dnsTimeout}
	in, _, err := udp.Exchange(m, ns)
	// two kinds of errors we can handle by retrying with TCP:
	// truncation and timeout; see https://github.com/caddyserver/caddy/issues/3639
	truncated := in != nil && in.Truncated
	timeoutErr := err != nil && strings.Contains(err.Error(), "timeout")
	if truncated || timeoutErr {
		tcp := &dns.Client{Net: "tcp", Timeout: dnsTimeout}
		in, _, err = tcp.Exchange(m, ns)
	}
	return in, err
}

func main() {

	m, err := dnsQuery("_acme-challenge.demo.kapi.kmpm.dev.", dns.TypeTXT, []string{"ns1.namesystem.se:53"}, false)
	fmt.Printf("err: %v,\nm: %+v\n", err, m)
}

2 Likes

When removing the SetEdns0 statement in certmagic ( custom, hacked build ) i got the certificate in seconds because the propagation check just worked.

BUT SetEdns0 will be needed if the TXT data is ā€œbigā€ so some kind of checking both methods perhaps?

1 Like

Huh, interesting. I have to admit this is going over my head, I donā€™t have a deep enough understanding of DNS to follow whatā€™s going on there.

We did introduce a change as I linked above to make it possible to turn off propagation checks, so that should probably be a usable workaround.

1 Like

That line of code tells the DNS resolver/server the client can accept larger DNS answers up to 4096 bytes instead of the legacy max size of 512 bytes. However, some firewall hate it when DNS responses are larger than 512 bytes, so they reject the message.

Iā€™m not sure itā€™s a sane default for us to remove it given we donā€™t know the size of the message and better err on the safe side. The recommended solution is to check your firewall configuration to allow DNS responses larger than 512 bytes. Note that the workaround listed on the linked page is for Windows DNS servers, which I donā€™t believe youā€™re using, at least not explicitly mentioned.

2 Likes

In this case itā€™s not my firewall at least. Could be at the registrar which i query directly and I have contacted their support and hope to get some information back.

There might be similar issues for other users with their registrars. I have found 2 in Sweden that behaves strange when using SetEdns0, Loopia and Glesys.

I got a first answer back from their technician. He had some nice inputs on size of the edns buffer.

The edns buffer size should be default 1232 bytes to avoid possible fragmentation unless you know that you need more. This According to 2020 | DNS flag day and what was presented there.
But there are some conflicting information in the RFCs and what was presented on dnsflagday.

1 Like

I will use propagation_delay and _timout as a workaround but I really want to figure this one out, just because ā€œhow hard can it beā€.

1 Like

I made a small project where I extracted the code that does the validation. With some modifications it has become a nice tool to troubleshoot this issue.
Just wanted to post a link if anyone finds it useful

I have a reminder set for this post to read the resources you linked later and understand the nuances of this particular DNS issue.

This topic was automatically closed after 30 days. New replies are no longer allowed.

Sorry for taking too long. I was distracted then forgot.

First, I appreciate that the ISP is keeping track of the RFCs and DNS Flag Day considerations! If we take the wording carefully, it does not say the message size SHOULD NOT exceed 1232 bytes. It says (to Authoritative DNS Operators):

[y]ou should also configure your servers to negotiate an EDNS buffer size that will not cause fragmentation. The value recommended here is 1232 bytes.

Note it says itā€™s a recommendation, not mandatory. The DNS Resolver Operators are asked to follow the same instructions for Authoritative DNS operators.

That said, the EDNS(0) RFC does pin some action on us in case we use 4096 as our starting point:

Ref: RFC 6891 - Extension Mechanisms for DNS (EDNS(0))

A good compromise may be the use of an EDNS maximum payload size of 4096 octets as a starting point.
A requestor MAY choose to implement a fallback to smaller advertised sizes to work around firewall or other network limitations. A requestor SHOULD choose to use a fallback mechanism that begins with a large size, such as 4096. If that fails, a fallback around the range of 1280-1410 bytes SHOULD be tried, as it has a reasonable chance to fit within a single Ethernet frame.

So we may choose to implement a fallback to a smaller size; and if fallback is implemented, we should try for a size between 1280-1410 bytes. Wondering if 1232 bytes is big enough for our use, I checked the RFC section outlining the specs for the DNS challenge. It says the client should compute the SHA-256 digest of the token, then set the TXT record with that digest value. SHA-256 is capped at 512 bytes, which is smaller than 1232 bytes (the recommended EDNS0 size).

Although the fallback idea is tempting, but Iā€™m afraid of the complexity of the code itā€™ll entail. Iā€™ll shoot a PR to set the ENDS0 value to 1232 bytes, per the recommendation.

2 Likes