Failed to get certificate: Error presenting token: Unexpected response code 'SERVFAIL' with cloudflare

SacredSkull · February 3, 2018, 10:53pm

I’ve had a strange error with CloudFlare dns that I just found a work-around for as I typed this - but it’s very ugly and to call it “automatic” would be a crime

Just as a note, using caddy 0.10.10 on Arch Linux x64 - latest release on github, and routing domains through cloudflare has been disabled (until this works).

Running the automatic HTTPS on my sites (including root and subdomains) initially got this:

CLOUDFLARE_EMAIL=--REDACTED-- CLOUDFLARE_API_KEY=--REDACTED-- /usr/local/bin/caddy -log stdout -agree=true -conf=/etc/Caddyfile -root=/var/tmp
Activating privacy features...
2018/02/03 22:07:52 [INFO][--REDACTED--] acme: Obtaining bundled SAN certificate
2018/02/03 22:07:53 [INFO][--REDACTED--] AuthURL: https://acme-v01.api.letsencrypt.org/acme/authz/--REDACTED--
2018/02/03 22:07:53 [INFO][--REDACTED--] acme: Could not find solver for: http-01
2018/02/03 22:07:53 [INFO][--REDACTED--] acme: Trying to solve DNS-01
[--REDACTED--] failed to get certificate: Error presenting token: Unexpected response code 'SERVFAIL' for _acme-challenge.--REDACTED--.

I’m fairly positive my configuration has nothing to do with this, and I’ve checked the cloudflare email and API key (both the global and CA ones) has anything to do with this - but here it is:

--REDACTED-- {
    import ../https.caddyfile
    import ../php.caddyfile
    gzip
    root /some/actual/directory
    rewrite {
        to {path} {path}/ /index.php?{query}
    }
}

The https.caddyfile contains the following:

tls my@actual.emailaddress {
    dns cloudflare
}

This wouldn’t work at all, printing the message above. I found a work around for at least my root record by manually creating a DNS TXT record on my cloudflare account. Even though the TXT record was filled with garbage, the cloudflare handler worked and actually changed this record’s data to the actual token and it solved correctly.

As far as I can tell this is an issue with cloudflare’s API - whether on their end or the resolver I don’t know. Disabling cloudflare on my website has no effect.

EDIT: I also just noticed that when I was testing this that I wasn’t running caddy as root/sudo and additionally I didn’t pass in the location of the SSL certificate store e.g. CADDYPATH=/etc/ssl/caddy but that had no effect whatsoever and I am still getting the error.

tobya · February 4, 2018, 9:32am

from a quick search, SERVFAIL seems to be a LetsEncrypt error when it is unable to resolve to your domain correctly. You may need to do some additional checking to ensure that your DNS records are resolving to your domain correctly.

Also you will need to ensure that your server can be see both on port 80 and 443

SacredSkull · February 4, 2018, 4:10pm

Just tried using the standard LetsEncrypt/certbot verification method and it worked perfectly with several domains, including the root and a sub-domain - so I don’t think it’s an error on my end. I even disabled DNSSEC on CloudFront and my registrar to no effect.

I also performed a manual check (on two PCs, on completely different internet connections) on my domain with Dig. For this person, the SERVFAIL error was caused by a bad nameserver - I doubted that CloudFlare’s nameservers are at fault here - but I tested them below.

My domain returns the correct information:

; <<>> DiG 9.12.0 <<>> -t type257 mydomain.com @abby.ns.cloudflare.com
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 9859
;; flags: qr aa rd; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 1
;; WARNING: recursion requested but not available

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 512
;; QUESTION SECTION:
;mydomain.com.               IN      CAA

;; Query time: 27 msec
;; WHEN: Sun Feb 04 16:06:10 GMT 2018
;; MSG SIZE  rcvd: 106

Fairly sure it’s an issue with the cloudflare API

Whitestrake · February 5, 2018, 12:24am

Just renewed a DNS-01 cert on my pfSense box, then on a Caddy instance. If Cloudflare is being funky, it must be NS-specific - mine are jean and jeff.

whitestrake at apollo in ~/Projects/test
❯ caddy -version
Caddy 0.10.10 (non-commercial use only)

whitestrake at apollo in ~/Projects/test
❯ cat Caddyfile
test.whitestrake.net {
  tls {
    dns cloudflare
  }
  status 200 /
}

whitestrake at apollo in ~/Projects/test
❯ caddy -agree -email letsencrypt@whitestrake.net -log stdout
Activating privacy features...
2018/02/05 10:16:48 [INFO][test.whitestrake.net] acme: Obtaining bundled SAN certificate
2018/02/05 10:16:50 [INFO][test.whitestrake.net] AuthURL: https://acme-v01.api.letsencrypt.org/acme/authz/MReRYh2WwMqDUvsb6hmu_deznQpuQSK6J2gpKqM1Lk8
2018/02/05 10:16:50 [INFO][test.whitestrake.net] acme: Trying to solve DNS-01
2018/02/05 10:17:22 [INFO][test.whitestrake.net] Checking DNS record propagation using [192.168.2.1:53 54.206.53.54:53 8.8.8.8:53 8.8.4.4:53]
2018/02/05 10:18:26 [INFO][test.whitestrake.net] The server validated our request
2018/02/05 10:18:27 [INFO][test.whitestrake.net] acme: Validations succeeded; requesting certificates
2018/02/05 10:18:28 [INFO] acme: Requesting issuer cert from https://acme-v01.api.letsencrypt.org/acme/issuer-cert
2018/02/05 10:18:28 [INFO][test.whitestrake.net] Server responded with a certificate.
2018/02/05 10:18:28 [INFO][test.whitestrake.net] Certificate written to disk: /Users/whitestrake/.caddy/acme/acme-v01.api.letsencrypt.org/sites/test.whitestrake.net/test.whitestrake.net.crt\

Why are you testing for CAA records, by the way?

SacredSkull · February 5, 2018, 11:17pm

I tested CAA records because in this issue SERVFAIL was returned because their nameservers were obviously buggy (they didn’t actually recognise the type of record that was being requested):

Your name servers are responding with SERVFAIL to all CAA queries

Though, in that issue it was a much smaller nameserver - the internet would have exploded with error reports if CloudFlare did the same thing , even if only two sets of nameservers were to blame.

Is it possible that the version of ACME library you’re using (I’m assuming certbot or something?) has bugged out?

To reiterate, I had no issues whatsoever when I manually verified these domains with certbot in either mode:

standalone
using Lighttp as the host web server (i.e. it spins up temporary files in its data directory)

So I can’t see how it can be on my end - though the fact you’ve got it working on cloudflare is confusing…

Any suggestions to what I might’ve missed? I don’t have any firewalls on my server, I just use the router for this - ports 80 and 443 are forwarded correctly (otherwise of course, certbot would have failed).

I’ll set up a DMZ on my server’s IP address, as a last resort. Failing that I honestly have no clue what else it could be if not on my end

Whitestrake · February 5, 2018, 11:24pm

Caddy makes use of GitHub - go-acme/lego: Let's Encrypt client and ACME library written in Go as its ACME library.

I know it doesn’t solve the problem you’ve having with DNS validation, but if you’ve got ports 80 and 443, why not use regular HTTP validation?

Port forwarding/DMZ will have no effect on the DNS challenge.

Emil_Lynge · February 6, 2018, 9:52pm

I’m having the same problem. Even in staging.
Using certbot I can successfully obtain a cert in staging:
sudo certbot --dns-cloudflare certonly --staging --email emillynge24@gmail.com -d hildemil.net --dns-cloudflare-credentials /root/cloudflare.ini

Which suggests that there is some issue with the caddy plugin rather than the DNS setup or the cloudflare API.

I know it doesn’t solve the problem you’ve having with DNS validation, but if you’ve got ports 80 and 443, why not use regular HTTP validation?

For me, the problem here is a matter of being able to hotswap an instance. If I have to setup an entirely new proxy, I cannot obtain certificates until the proxy is live, which will cause downtime until new certs are obtained.

Also, I like to be able to test my setup locally using docker containers, in which case it is nice to be able to get certificates even when my proxy is not publically reachable.

Whitestrake · February 7, 2018, 12:16am

Yeah, that’s a pretty good reason for DNS validation, I was mostly just curious about @SacredSkull’s situation since it looks like he’s testing with HTTP validation.

There must be some environmental factor or circumstance that you both share. I can get a DNS-validated cert with the pfSense ACME package, Caddy, and lego itself.

Might be time to open an issue on the Caddy repo.

SacredSkull · February 7, 2018, 1:11am

Oops, you’re right. I was testing with HTTP validation - just tried Certbot with the manual DNS approach certbot -d mydomain.net--manual --preferred-challenges dns certonly. It failed to begin with, the second time around I waited for changes to propagate and it worked.

I see that Lego supports “standalone” (self-hosting) and “webroot” (you specify the domain root filesystem path). Forgive me if I have missed this somewhere, but can you configure Caddy to use (automatic) HTTP validation?

I don’t mind using either approach but if two people are having issues with the DNS validator there might be an actual issue there.

SacredSkull · February 7, 2018, 1:13am

Out of interest, what name servers do you have on Cloudflare and what distro are you using?

For me:

abby.ns.cloudflare.com
greg.ns.cloudflare.com

…And I’m running Arch Linux x64.

Whitestrake · February 7, 2018, 1:17am

Hmm… Did it fail with a SERVFAIL the first time, or some other error?

Yep! Caddy’s default behaviour is HTTP-01 or TLS-SNI-01 validation (although I think the latter is still disabled by LetsEncrypt? Caddy will use whichever works anyway). Just remove the dns cloudflare from your tls directive and it will go back to those.

It definitely seems to be an issue - whether it’s some specific DNS zone item, or Cloudflare response, causing an issue for lego or possibly even something Caddy is doing, I’m unsure.

SacredSkull · February 7, 2018, 2:27am

No SERVFAIL, just the client lacks sufficient authorization :: No TXT record found at _acme-challenge.mydomain.com - the changes just weren’t visible yet, AFAIK.

I wasn’t even aware that you could use an empty set of brackets for TLS… I just saw Cloudflare and went with it apparently

Whitestrake · February 7, 2018, 3:04am

If there’s nothing in the brackets, you can remove them entirely. Just tls email@example.com on its own will work. Or even no tls directive at all - no need to put it in if you don’t need to configure anything (such as if you use the -email flag when running Caddy to specify the email for all your sites).

SacredSkull · February 7, 2018, 4:11am

Instead of creating a new thread just to ask this: Is there any way to provide a fallback fastcgi address? I’m using PHP and whilst I usually prefer HHVM, it isn’t the most reliable thing ever.

In Nginx I could set up several fastcgi addresses to try in order (i.e. HHVM first, if that fails, PHP-FPM) - is this possible in Caddy?

Whitestrake · February 7, 2018, 4:28am

You can specify multiple addresses with the upstream subdirective, but per documentation, it performs only basic load balancing. I don’t believe that includes blacklisting / health checking unreliable upstream addresses.

https://caddyserver.com/docs/fastcgi

matt · February 7, 2018, 4:33am

Someday, if I ever finish the proxy rewrite, it will have all the same features as the proxy directive. Just not yet.

Emil_Lynge · February 7, 2018, 9:56am

I think i found my problem last night (but then the battery died).

For me, the problem seems to be local dns caching that was returning outdated SOA records (i think those have a very long TTL). I just moved my DNS from linode to cloudflare, and SOA lookup was returning linode nameservers since those were cached locally. This causes caddy to direct subsequent DNS requests to linode name servers, which would then fail since i turned off linode DNS manager.
The error message I got however, was (to me at least) quite confusing, since it looked like I got the error from letesencrypt, when in actuality the error was coming from a lookup from caddy to a nameserver.

I have not yet confirmed this, but I expect the issue to be resolved when I try again after a fresh boot.

system · May 8, 2018, 9:56am

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.