Optimal lb_policy settings for production

Hi,
I am just getting to setting up Caddy v2 as a load balancer for our prod environment. I am curious to see what the optimal values for that should be. Currently I have this setup

*.example.com {
    tls {
        on_demand
    }
    reverse_proxy {
        transport http {
            dial_timeout 600ms
        }
        to 10.14.0.6 10.14.0.5:8080
        lb_policy round_robin
        lb_try_duration 3s
        lb_try_interval 1s
        fail_duration 20s
    }
}

I got those settings off of any example I had found earlier. I am not even sure if certain settings are even needed. For instance, dial_timeout, lb_try_duration, lb_try_interval, and fail_duration

If those are indeed needed, what would be the best values for optimal performance?

Make sure to configure an ask endpoint. Using on_demand is dangerous otherwise.

There’s no silver-bullet, it entirely depends on your setup and performance characteristics.

But that seems fine.

With that, if dialing fails then it will only try at most once more, because dial 600ms > wait interval 1s > retry attempt at 1.6s > dial 600ms > wait interval 1s > 3.2s which is outside the try_duration window – so you may want to tweak the numbers if you want to retry more than once.

The fail_duration option turns on passive health checks. Setting it to 20s means if one of the backends fails to connect (dial timeout is reached for example) then it’ll be marked unhealthy, and won’t be retried for 20s. I think 20s is pretty long because that means if all your backends go down, it’ll take up to 20s for them to become available again after being fixed. You can adjust the numbers if you run into problems.

Also, you may want to configure max_fails if you don’t want to immediately wait 20s if one of the backends has a dial problem, it could probably be tried a few more times before it’s actually taken out forcing the 20s wait.

You might want to turn on active health checks, which may notice earlier if a backend has a problem. The benefits of that depend on your traffic level though, if you have pretty low/infrequent traffic, then periodic active health check would give you a baseline requests-per-second where problems are noticed even if no real requests are coming in.

But yeah, essentially this seems fine.

1 Like

@francislavoie ma man, thank you for reaching out. It says max_fails is the number of failed requests within fail_timeout. I can’t seem to find anything on fail_timeout. Is that the same as fail_duration?

Ah, yeah I think that was meant to refer to fail_duration. Mistake in the docs.

Fixed, will be in the next push to the website:

2 Likes

Hey, I am just curious… Does this setting ip_hash mean that the server assignment is done in a round-robin fashion, but once the user gets a server assigned he/she is pretty much served from that server only?

No, ip_hash doesn’t do any round-robin. It just assigns an upstream directly from a hash of the IP address:

1 Like

Is there a way to have a round robin assignment that afterwards turns into a sticky session for the user?

The cookie one is likely what you’d want.

The key is that something needs to be used to “remember” which client it was. The easiest way to do that is with a cookie, making the client store which upstream it connected to last.

If Caddy had to store this information for each client that connected, that would be expensive in memory, and it would likely need to be capped at storage for only a limited amount of clients.

Gotcha. Do you have an example on how to use the cookie? Is it just doing lb_policy: cookie?

The docs show the syntax. lb_policy cookie [<name> [<secret>]]. Takes an optional name (the [ ] mean optional) of the cookie that will be written back to the client, and an optional secret to HMAC hash the cookie value. The name by default will be lb.

How is the server picked from the pool?

Random, then the name of the upstream is hashed then set as a cookie then proxies. If the cookie is present in the request, it loops through each upstream and hashes them to compare to find a match. If a match is found, goes there. If no match, random again.

Sorry, last question… is there any impact on performance choosing cookie vs round_robin?

Negligible. (Technically, round_robin requires less computation.)

2 Likes

@matt @francislavoie I have updated our prod settings to use the cookie setting, and it is working well! I also noticed there is another sticky session setting, namely ip_hash. There is not much documentation on that, so please let me know if I am understanding it correctly.

cookie is set in the client’s browser. If the client happens to delete the cookie, then another server might be assigned to the client on a new page load.

ip_hash is set based on the client’s IP. So, even if the client deletes all the cookies that client is still kept on the same server ( unless their IP changes )

is that the gist?

1 Like

That’s right. It’s done mathematically though (no memory). If an upstream server that would normally be picked goes down, it’ll fallback to the other. If that server comes back online, it won’t stay on that fallback, it’ll revert back to the first one. If you add additional upstreams, there’s no guarantees of which server will be used.

2 Likes

This topic was automatically closed after 30 days. New replies are no longer allowed.