I’d like to push new domains via the api (these will be from client sites I don’t have control over, so they will create cnames pointing to my dns), some of which will be *.myexamplesite.com to create a wildcard certificate (client does not want separate certs for all subdomains).
I’ve noticed in order to do this, I need to push a new section to the routes /config/apps/http/servers/srv0/routes/0 but also to the policies /config/apps/tls/automation/policies/0 in order to set the dns_challenge_override_domain (as this is delegated from another site).
My question is - is it possible to have a global setting for this? Something like
Can I ask with regards to scalability of this method - the api is essentially adding or rebuilding the json file - how well is this processed at high volumes? I.e. if I’m onboarding thousands of clients and adding a new route block for each one, is that eventually going to create a performance problem?
I’m not really sure what to tell you with the information provided. Thousands of lines of JSON takes more CPU cycles than a few lines of JSON. But computers are fast, so, I dunno.
We’d have to see your full config to get a better sense of performance implications beyond that.
If you’re using JSON config, my recommendation would be to avoid repetition where possible. With the JSON one is often able to craft pretty elegant configs.
Can I ask with regards to scalability of this method - the api is essentially adding or rebuilding the json file - how well is this processed at high volumes? I.e. if I’m onboarding thousands of clients and adding a new route block for each one, is that eventually going to create a performance problem?
As someone who runs many Caddy clusters, some with 30k+ domains, the performance has not been an issue even on shared CPU vms. Any time I run into performance issues with updating the config, it’s because I’m doing something unusual and so far it has always been on my end, not Caddy’s.
I did find it easier however to not send 1 config update at a time (as they occur), but instead to generate the entire json config file separately, and every minute update Caddy with it only if the config has changed. That helped reduce intermittent issues where for whatever reason it didn’t actually add/update/delete a domain with the admin API (caused by a networking blip usually).
If you’re using JSON config, my recommendation would be to avoid repetition where possible. With the JSON one is often able to craft pretty elegant configs.
@matt my configs have a lot of repetition with reverse proxy routes where the only real difference is the upstream URL and the port. The repetitive lines are where I add headers to the upstream and the response. Is there a way that I could reduce the repetition with a default setting or is there some concept of a “write once, refer many times” config variable?
@francislavoie
I’m finding this doesn’t work if I take off the _acme-challenge.
I have a cname record pointing from _acme-challenge.manage.clientsite.com > _acme-challenge.myexamplesite.com
If I change the config as you suggested, I get the following error:
2022/08/08 09:08:34.573 ERROR tls.issuance.acme.acme_client cleaning up solver {"identifier": "*.manage.clientsite.com", "challenge_type": "dns-01", "error": "no memory of presenting a DNS record for manage.clientsite.com (probably OK if presenting failed)"}
2022/08/08 09:08:34.730 ERROR tls.obtain could not get certificate from issuer {"identifier": "*.manage.clientsite.com", "issuer": "acme-staging-v02.api.letsencrypt.org-directory", "error": "[*.manage.clientsite.com] solving challenges: presenting for challenge: adding temporary record for zone myexamplesite.com.: InvalidChangeBatch: operation error Route 53: ChangeResourceRecordSets, https response error StatusCode: 400, RequestID: 123, InvalidChangeBatch: [Tried to create resource record set [name='myexamplesite.com.', type='TXT'] but it already exists] (order=https://acme-staging-v02.api.letsencrypt.org/acme/order/123/123) (ca=https://acme-staging-v02.api.letsencrypt.org/directory)"}
Looking at my dns records in route53, the acme record does not exist before or after this runs.
However, with my current config with the _acme-challenge included int the override domain it does successfully get the certificates, however if fails to clean up the dns record
There’s no concept of “references” in JSON config. You could use a host matcher which matches multiple domains at once, and have all your header/upstream stuff inside of that, if it’s common. But other than that, not really. It’s easier for Caddy to provision from a config if the config is flat.
That’s strange. I would expect certmagic to append that itself before telling the DNS plugin to update the TXT record. Hmm… lemme read the code…
Well then. Apparently it overrides the call to DNS01TXTRecordName from acmez which returns "_acme-challenge." + c.Identifier.Value. So I guess it is required to include that prefix.
That’s pretty annoying. It should probably be adjusted to add that prefix if it doesn’t exist in the configured value. WDYT @matt ?
Ah, that’s because of this bug which has already been fixed upstream, but hasn’t been included in a Caddy release yet.
Thanks for chiming in @Carter_Bryden – nice to see you again!
Interesting – config reloads are a no-op if they haven’t changed (are byte-for-byte the same). Was Caddy still reloading an unchanged config? I’d like to know more about that.
Most repetition can be reduced through the use of map or vars / placeholders. Can you post your config? (Maybe in a new topic.)
@davebain Ah yeah, Francis is right about the bug that was recently fixed for the cleanup phase.
Hmm, yes that is true. Is there any reason someone would need to customize the entire challenge domain? If so I can see this being beneficial. But I think the _acme_challenge subdomain is hard-coded into the DNS challenge spec. So maybe we should always prepend it. I pinged the author of the PR for that feature to check with them; otherwise I’m OK with prepending it ourselves.
@matt If it is, I guess it’s only relevant for the first cname - i.e. the client site might be _acme-challenge.clientsite.com but if the challenge is delegated that could be pointing to another record with any name, and that’s the one that needs to be cleaned up.
Thanks @francislavoie
Do you have any idea when this would be available in a caddy release? Just want to get an idea if we’re talking days, weeks or months.
Interesting – config reloads are a no-op if they haven’t changed (are byte-for-byte the same). Was Caddy still reloading an unchanged config? I’d like to know more about that.
In my use case, I might have 20+ instances all over the globe and 30k+ domains on that cluster. When a user wanted to add a new domain/subdomain, previously I’d hit the admin endpoint of every instance in the cluster. But it would struggle if those were happening too fast, and it was easier for instance’s to get out of sync. Having each instance use something like a cron job to pull in from a central location every minute if the config has changed was more reliable. Especially if domains were being added rapid fire, something like 1000 in a minute (from my own user), which does happen in some cases.
Most repetition can be reduced through the use of map or vars / placeholders. Can you post your config? (Maybe in a new topic.)
That’s something I’ve actually been meaning to ask about. I can’t really ever post a proper non-redacted config here because it would be exposing a ton of real customer data which would get me into trouble ethically and potentially legally. That’s why I’m not in here too often. If there was some private way to do that (maybe for a certain level of sponsor?), I might be able to offer a less redacted config, but I can’t post it publicly. Also, the configs can be like 20mb of json (30-50k domains) so that would be tricky too. Not that it’s your fault or that you owe me support! I’m just responsible for that data.
Would all API requests struggle, or just the ones that made changes? (i.e. are you POSTing unchanged configs and those requests also struggled? or just requests with different configs)
And what do you mean by “struggle” exactly? Too much CPU, high latency, etc? What symptoms were you experiencing?
Absolutely; I can provide help in private to sponsors (Indie Pro or higher). Generally I recommend sponsorship tiers correspond with your company’s size/scale so you can have the resources you need to support your business. The tier names should be a good indicator of that, but you can sign up for any tier that has the perks you need/want. We can also customize sponsorship plans, just let me know if you have questions about that.
Would all API requests struggle, or just the ones that made changes? (i.e. are you POSTing unchanged configs and those requests also struggled? or just requests with different configs)
And what do you mean by “struggle” exactly? Too much CPU, high latency, etc? What symptoms were you experiencing?
CPU and memory were going way up and sometimes crashing the VM it was running in, and I would see logs that looked like it was causing either timeouts or killing a request/process when it reloaded. The behavior wasn’t always totally predictable and I think had to do with how much traffic might be proxying through at that time too. Sometimes it was totally fine, and sometimes the VM was crashing and restarting 20 times an hour (a user sequentially importing a bunch of domains).
Granted that was a few Caddy versions ago so I’m betting a lot of that has been sorted out. Even if that doesn’t happen though, I just found it simpler to have many instances check in to a central source periodically, than have a central app push out to twenty instances for every update (basically multiplies queue jobs a few times).
That’s good you figured it out then; I imagine you’re using the config_load feature that can automatically pull in configs on an interval.
I’m still curious what exactly was causing the high memory usage. A profile would help here, next time it happens. (:2019/debug/pprof)
If you’re hard-coding the domains into host matchers, it wouldn’t surprise me if it’s the loading and decoding of the certificates, if you’re not using on-demand TLS. (I’d still recommend that if you’re doing so many domains dynamically.)
This is something I’d like to optimize for your use case since I want your experience to be the best it can be.