Global dns challenge and dns_challenge_override_domain

1. Output of caddy version:

v2.5.2

2. How I run Caddy:

brew localhost

a. System environment:

mac

b. Command:

caddy run

d. My complete Caddy config:

{
    acme_ca https://acme-staging-v02.api.letsencrypt.org/directory
}

3. The problem I’m having:

I’d like to push new domains via the api (these will be from client sites I don’t have control over, so they will create cnames pointing to my dns), some of which will be *.myexamplesite.com to create a wildcard certificate (client does not want separate certs for all subdomains).
I’ve noticed in order to do this, I need to push a new section to the routes /config/apps/http/servers/srv0/routes/0 but also to the policies /config/apps/tls/automation/policies/0 in order to set the dns_challenge_override_domain (as this is delegated from another site).

My question is - is it possible to have a global setting for this? Something like

{
	tls {
		dns route53 {
			aws_profile "my_profile"
			max_retries 1
		}
		dns_challenge_override_domain _acme-challenge.myexamplesite.com
	}
}

The current policies look like this if I dump the config:

{
            "subjects": ["*.manage.clientsite.com", "manage.clientsite.com"],
            "issuers": [
              {
                "ca": "https://acme-staging-v02.api.letsencrypt.org/directory",
                "challenges": {
                  "dns": {
                    "override_domain": "_acme-challenge.myexamplesite.com",
                    "provider": {
                      "aws_profile": "my_profile",
                      "max_retries": 1,
                      "name": "route53"
                    }
                  }
                },
                "module": "acme"
              },
...

5. What I already tried:

Currently just pushing separately to the routes and policies, I can’t see anything in the docs for a global dns_challenge_override_domain.

In the JSON config, a TLS connection policy can apply to all hostnames by leaving out the subjects key. Then you only need to specify it once.

Don’t include _acme-challenge here, it should just be the actual domain, and that subdomain will be added during the lookup.

Great thank you.

Can I ask with regards to scalability of this method - the api is essentially adding or rebuilding the json file - how well is this processed at high volumes? I.e. if I’m onboarding thousands of clients and adding a new route block for each one, is that eventually going to create a performance problem?

I’m not really sure what to tell you with the information provided. Thousands of lines of JSON takes more CPU cycles than a few lines of JSON. But computers are fast, so, I dunno.

We’d have to see your full config to get a better sense of performance implications beyond that.

If you’re using JSON config, my recommendation would be to avoid repetition where possible. With the JSON one is often able to craft pretty elegant configs.

Can I ask with regards to scalability of this method - the api is essentially adding or rebuilding the json file - how well is this processed at high volumes? I.e. if I’m onboarding thousands of clients and adding a new route block for each one, is that eventually going to create a performance problem?

As someone who runs many Caddy clusters, some with 30k+ domains, the performance has not been an issue even on shared CPU vms. Any time I run into performance issues with updating the config, it’s because I’m doing something unusual and so far it has always been on my end, not Caddy’s.

I did find it easier however to not send 1 config update at a time (as they occur), but instead to generate the entire json config file separately, and every minute update Caddy with it only if the config has changed. That helped reduce intermittent issues where for whatever reason it didn’t actually add/update/delete a domain with the admin API (caused by a networking blip usually).

1 Like

If you’re using JSON config, my recommendation would be to avoid repetition where possible. With the JSON one is often able to craft pretty elegant configs.

@matt my configs have a lot of repetition with reverse proxy routes where the only real difference is the upstream URL and the port. The repetitive lines are where I add headers to the upstream and the response. Is there a way that I could reduce the repetition with a default setting or is there some concept of a “write once, refer many times” config variable?

@francislavoie
I’m finding this doesn’t work if I take off the _acme-challenge.
I have a cname record pointing from _acme-challenge.manage.clientsite.com > _acme-challenge.myexamplesite.com
If I change the config as you suggested, I get the following error:

2022/08/08 09:08:34.573 ERROR   tls.issuance.acme.acme_client   cleaning up solver      {"identifier": "*.manage.clientsite.com", "challenge_type": "dns-01", "error": "no memory of presenting a DNS record for manage.clientsite.com (probably OK if presenting failed)"}
2022/08/08 09:08:34.730 ERROR   tls.obtain      could not get certificate from issuer   {"identifier": "*.manage.clientsite.com", "issuer": "acme-staging-v02.api.letsencrypt.org-directory", "error": "[*.manage.clientsite.com] solving challenges: presenting for challenge: adding temporary record for zone myexamplesite.com.: InvalidChangeBatch: operation error Route 53: ChangeResourceRecordSets, https response error StatusCode: 400, RequestID: 123, InvalidChangeBatch: [Tried to create resource record set [name='myexamplesite.com.', type='TXT'] but it already exists] (order=https://acme-staging-v02.api.letsencrypt.org/acme/order/123/123) (ca=https://acme-staging-v02.api.letsencrypt.org/directory)"}

Looking at my dns records in route53, the acme record does not exist before or after this runs.

However, with my current config with the _acme-challenge included int the override domain it does successfully get the certificates, however if fails to clean up the dns record

2022/08/08 09:39:01.893 ERROR   tls.issuance.acme.acme_client   cleaning up solver      {"identifier": "*.manage.clientsite.com", "challenge_type": "dns-01"}

Any suggestions, is the target cname correct?

UPDATE: I’ve raised an issue against the route53 plugin for this: Unable to pass delegated DNS challenge when using caddy dns_challenge_override_domain · Issue #24 · caddy-dns/route53 · GitHub

1 Like

There’s no concept of “references” in JSON config. You could use a host matcher which matches multiple domains at once, and have all your header/upstream stuff inside of that, if it’s common. But other than that, not really. It’s easier for Caddy to provision from a config if the config is flat.

That’s strange. I would expect certmagic to append that itself before telling the DNS plugin to update the TXT record. Hmm… lemme read the code…

Well then. Apparently it overrides the call to DNS01TXTRecordName from acmez which returns "_acme-challenge." + c.Identifier.Value. So I guess it is required to include that prefix.

That’s pretty annoying. It should probably be adjusted to add that prefix if it doesn’t exist in the configured value. WDYT @matt ?

Ah, that’s because of this bug which has already been fixed upstream, but hasn’t been included in a Caddy release yet.

1 Like

Thanks for chiming in @Carter_Bryden – nice to see you again!

Interesting – config reloads are a no-op if they haven’t changed (are byte-for-byte the same). Was Caddy still reloading an unchanged config? I’d like to know more about that. :thinking:

Most repetition can be reduced through the use of map or vars / placeholders. Can you post your config? (Maybe in a new topic.)

@davebain Ah yeah, Francis is right about the bug that was recently fixed for the cleanup phase.

Hmm, yes that is true. Is there any reason someone would need to customize the entire challenge domain? If so I can see this being beneficial. But I think the _acme_challenge subdomain is hard-coded into the DNS challenge spec. So maybe we should always prepend it. I pinged the author of the PR for that feature to check with them; otherwise I’m OK with prepending it ourselves.

@matt If it is, I guess it’s only relevant for the first cname - i.e. the client site might be _acme-challenge.clientsite.com but if the challenge is delegated that could be pointing to another record with any name, and that’s the one that needs to be cleaned up.

1 Like

Thanks @francislavoie
Do you have any idea when this would be available in a caddy release? Just want to get an idea if we’re talking days, weeks or months. :grinning:

1 Like

Probably weeks, at this point.

Unless Matt decides we’re about ready to cut a release.

But I think there’s still some open issues we want to resolve first before a release, so it’ll take a bit of time. But not too much time.

I think you can build from the master branch right now though, the change should be there already, as of go.mod: Upgrade CertMagic and acmez · caddyserver/caddy@63c7720 · GitHub

1 Like

Interesting – config reloads are a no-op if they haven’t changed (are byte-for-byte the same). Was Caddy still reloading an unchanged config? I’d like to know more about that. :thinking:

In my use case, I might have 20+ instances all over the globe and 30k+ domains on that cluster. When a user wanted to add a new domain/subdomain, previously I’d hit the admin endpoint of every instance in the cluster. But it would struggle if those were happening too fast, and it was easier for instance’s to get out of sync. Having each instance use something like a cron job to pull in from a central location every minute if the config has changed was more reliable. Especially if domains were being added rapid fire, something like 1000 in a minute (from my own user), which does happen in some cases.

Most repetition can be reduced through the use of map or vars / placeholders. Can you post your config? (Maybe in a new topic.)

That’s something I’ve actually been meaning to ask about. I can’t really ever post a proper non-redacted config here because it would be exposing a ton of real customer data which would get me into trouble ethically and potentially legally. That’s why I’m not in here too often. If there was some private way to do that (maybe for a certain level of sponsor?), I might be able to offer a less redacted config, but I can’t post it publicly. Also, the configs can be like 20mb of json (30-50k domains) so that would be tricky too. Not that it’s your fault or that you owe me support! I’m just responsible for that data.

1 Like

Would all API requests struggle, or just the ones that made changes? (i.e. are you POSTing unchanged configs and those requests also struggled? or just requests with different configs)

And what do you mean by “struggle” exactly? Too much CPU, high latency, etc? What symptoms were you experiencing?

Absolutely; I can provide help in private to sponsors (Indie Pro or higher). Generally I recommend sponsorship tiers correspond with your company’s size/scale so you can have the resources you need to support your business. The tier names should be a good indicator of that, but you can sign up for any tier that has the perks you need/want. We can also customize sponsorship plans, just let me know if you have questions about that.

1 Like

Would all API requests struggle, or just the ones that made changes? (i.e. are you POSTing unchanged configs and those requests also struggled? or just requests with different configs)
And what do you mean by “struggle” exactly? Too much CPU, high latency, etc? What symptoms were you experiencing?

CPU and memory were going way up and sometimes crashing the VM it was running in, and I would see logs that looked like it was causing either timeouts or killing a request/process when it reloaded. The behavior wasn’t always totally predictable and I think had to do with how much traffic might be proxying through at that time too. Sometimes it was totally fine, and sometimes the VM was crashing and restarting 20 times an hour (a user sequentially importing a bunch of domains).

Granted that was a few Caddy versions ago so I’m betting a lot of that has been sorted out. Even if that doesn’t happen though, I just found it simpler to have many instances check in to a central source periodically, than have a central app push out to twenty instances for every update (basically multiplies queue jobs a few times).

1 Like

That’s good you figured it out then; I imagine you’re using the config_load feature that can automatically pull in configs on an interval.

I’m still curious what exactly was causing the high memory usage. A profile would help here, next time it happens. (:2019/debug/pprof)

If you’re hard-coding the domains into host matchers, it wouldn’t surprise me if it’s the loading and decoding of the certificates, if you’re not using on-demand TLS. (I’d still recommend that if you’re doing so many domains dynamically.)

This is something I’d like to optimize for your use case since I want your experience to be the best it can be.