I am trying to achieve service discovery and failover for my app using Caddy and Consul DNS. I have two Consul DCs and two instances of my app (one per dc). Each service instance has Consul health check. So, if the consul health check fails - Consul excludes failed instance from DNS response (it’s by design of Consul). E.g if explorer-dashboard in dc1 fails - DNS query to explorer-dashboard.service.dc1.consul will return nothing. I’m expecting that Caddy can handle this.
I mean its behaviour will be like "resolving explorer-dashboard.dc1 - OK and resolving explorer-dashboard.dc2 - FAILED. Okay, the good upstream for proxying is only explorer-dashboard.dc1. I will route all traffic to it and as explorer-dashboard.dc2 will be raised up I will perform a load balancing ". But Caddy doesn’t work as I expect. It returns HTTP 502 on each second request because one instance of my app is down, e.g
the first request - curl https://my-domain.com returns 200
the second request -curl https://my-domain.com returns 502
It tries to resolve DNS SRV record of the failed instance and receive an error from the DNS resolver
How can I exclude instances which cannot be resolved from traffic routing?
4. Error messages and/or full log output:
ERROR http.log.error making dial info: lookup explorer-dashboard.service.dc1.consul on 127.0.0.53:53: no such host
5. What I already tried:
I tried to tune reverse_proxy healthchecks but It seems that the problem not in healthchecks (or I am a noob and missed something in docs)
Got it, thank you for clarifying! So, I see two ways how I can solve my case:
Create an external simple bridge app that will be working in the background and loads only healthy instances IPs to Caddy via API .
Implement a DNS SRV tryAgain logic in Caddy (as you said)
The second looks more straightforward I think. I will try to hack around this for some time. Will create a PR If have any success with the implementation!
UPD: Found much simpler solution! Consul has prepared-queries (see - Automate Geo-Failover with Prepared Queries | Consul - HashiCorp Learn) and you can use service-name.query.consul instead of service-name.service.consul . In this way, SRV record will be always resolved while at least one instance in all datacentres is healthy, thus we do not need to implement this logic directly in Caddy!