Load balancing problem with docker dnsrr mode

Andrey_Izotov · June 16, 2022, 5:55am

I use caddy as load balancer for my docker swarm cluster.
I use dynamic A upstream resolver to get host IPs.

Here is my configuration:

api.globus.furniture {
    reverse_proxy {
        dynamic a {
            name api-service
            port 80
        }
        lb_policy ip_hash
        health_uri /base/health
        header_up Host {upstream_hostport}
        header_down +X-Used-Endpoint {upstream_hostport}
        header_down -server
    }
    encode zstd gzip
    log
}

The problem comes with service updates, When I update service of my cluster I get significant downtime.
Docker makes sure a container is up and running before shutting down the previous one, so i think the problem is not with docker setup.

If before update my nodes ips (resolved from dns a record) were:
10.0.0.1
10.0.0.2
10.0.0.3
After update all IPS change.
So for example i will get:
10.0.0.4
10.0.0.5
10.0.0.6

Caddy by default only updates dns records every minute. So depending on the luck, i get up to a minute of downtime.

The solution I see for this is to refresh dns every time upstream health check fails. Or maybe when all upstream health check failed. It can guarantee I will get little to no downtime. But I have no idea how to make it possible with current configuration.

francislavoie · June 16, 2022, 1:22pm

You can change the refresh interval in the config, you can try 2s or even 0. Since DNS is resolved by a machine close by, it shouldn’t be slow/harmful to make DNS queries every time. But if you want some caching, you can turn on retries with lb_try_duration, and set it to something like 5s, or at least longer than the DNS refresh interval to guarantee it will refresh DNS before trying again.

Your idea of refreshing if a connection fails is good idea, I’m not sure how easy it is to implement though because it would require you to configure retries, and the dynamic upstream provider would need to be aware that the next attempt is a retry. Right now there’s no distinction.

Andrey_Izotov · June 17, 2022, 5:07am

Thanks I will use your solution first

system · July 16, 2022, 5:55am

This topic was automatically closed after 30 days. New replies are no longer allowed.