Reverse-proxy load-balancing using Consul DNS

alexeirbv · May 25, 2021, 8:16pm

1. Caddy version (`caddy version`):

v2.4.1 h1:kAJ0JB5Xk5gPdTH/27S5cyoMGqD5lBAe9yZ8zTjVJa0=

2. How I run Caddy:

Running caddy run in my terminal emulator

a. System environment:

Distributor ID:	Ubuntu
Description:	Ubuntu 20.04.2 LTS
Release:	20.04
Codename:	focal

Linux: 5.4.0-70-generic

b. Command:

caddy run

c. Service/unit/compose file:

Not using Docker/systemd/Kubernetes/make etc

d. My complete Caddyfile or JSON config:

{
	acme_dns cloudflare <cloudlfare_token>

	storage consul {
		address "127.0.0.1:8500"
		token " <consul_token>"
		timeout 10
		prefix "caddytls"
		value_prefix "caddysite"
		aes_key "<aes_key>"
		tls_enabled "false"
		tls_insecure "true"
	}
}

<my-domain> {
	reverse_proxy {
		to srv+http://explorer-dashboard.service.dc1.consul srv+http://explorer-dashboard.service.dc2.consul
		lb_try_duration 2s
		lb_policy round_robin
		fail_duration 5s
		max_fails 2
		unhealthy_status 5xx
		unhealthy_request_count 2
	}
}

3. The problem I’m having:

I am trying to achieve service discovery and failover for my app using Caddy and Consul DNS. I have two Consul DCs and two instances of my app (one per dc). Each service instance has Consul health check. So, if the consul health check fails - Consul excludes failed instance from DNS response (it’s by design of Consul). E.g if explorer-dashboard in dc1 fails - DNS query to explorer-dashboard.service.dc1.consul will return nothing. I’m expecting that Caddy can handle this.
I mean its behaviour will be like "resolving explorer-dashboard.dc1 - OK and resolving explorer-dashboard.dc2 - FAILED. Okay, the good upstream for proxying is only explorer-dashboard.dc1. I will route all traffic to it and as explorer-dashboard.dc2 will be raised up I will perform a load balancing ". But Caddy doesn’t work as I expect. It returns HTTP 502 on each second request because one instance of my app is down, e.g
the first request - curl https://my-domain.com returns 200
the second request -curl https://my-domain.com returns 502
It tries to resolve DNS SRV record of the failed instance and receive an error from the DNS resolver

How can I exclude instances which cannot be resolved from traffic routing?

4. Error messages and/or full log output:

ERROR	http.log.error	making dial info: lookup explorer-dashboard.service.dc1.consul on 127.0.0.53:53: no such host

5. What I already tried:

I tried to tune reverse_proxy healthchecks but It seems that the problem not in healthchecks (or I am a noob and missed something in docs)

6. Links to relevant resources:

–

francislavoie · May 25, 2021, 8:53pm

I think the trouble is that the SRV support for Caddy wasn’t implemented to support this type of usecase (because we didn’t really get anyone voicing this type of usecase when it was implemented).

If you follow the code, you can pretty clearly see what’s going on.

github.com

caddyserver/caddy/blob/master/modules/caddyhttp/reverseproxy/reverseproxy.go#L406

    
      
          	// dialer will behave. See #4237 for context.
          	origURLScheme := r.URL.Scheme
          	origURLHost := r.URL.Host
          	r.URL.Scheme = ""
          	r.URL.Host = ""
          
          
	// restore modifications to the request after we're done proxying
          	defer func() {
          		r.Host = reqHost     // TODO: data race, see #4038
          		r.Header = reqHeader // TODO: data race, see #4038
          		r.URL.Scheme = origURLScheme
          		r.URL.Host = origURLHost
          	}()
          
          
	start := time.Now()
          	defer func() {
          		// total proxying duration, including time spent on LB and retries
          		repl.Set("http.reverse_proxy.duration", time.Since(start))
          	}()
          
          
	var proxyErr error

Look for “making dial info”, that’s the case you’re hitting. Caddy tries to resolve the SRV address to an upstream, but failed, so it gives up right then by returning an error.

Maybe in the case of SRV, it shouldn’t return an error, but instead do a tryAgain to select a different upstream.

It’s clear there’s some refactoring to do with SRV support, see this issue which is somewhat related:

alexeirbv · May 25, 2021, 9:16pm

Got it, thank you for clarifying! So, I see two ways how I can solve my case:

Create an external simple bridge app that will be working in the background and loads only healthy instances IPs to Caddy via API .
Implement a DNS SRV tryAgain logic in Caddy (as you said)

The second looks more straightforward I think. I will try to hack around this for some time. Will create a PR If have any success with the implementation!

UPD: Found much simpler solution! Consul has prepared-queries (see - Automate Geo-Failover with Prepared Queries | Consul - HashiCorp Learn) and you can use service-name.query.consul instead of service-name.service.consul . In this way, SRV record will be always resolved while at least one instance in all datacentres is healthy, thus we do not need to implement this logic directly in Caddy!

system · June 24, 2021, 8:17pm

This topic was automatically closed after 30 days. New replies are no longer allowed.