High availability mode for load balancing?

basil · July 18, 2021, 8:08pm

A general query at this stage. Caddy trivialises load balancing e.g.

office.mydomain.com {
  reverse_proxy 10.1.1.13:8880 10.1.1.12:9880
}

I’ve become aware though it doesn’t necessarily mean that the upstream services actually support this mode of operation. It appears that the office application I’m trying to load balance supports collaborative editing within an upstream instance, but not across upstream instances.

Is it possible to set up load balancing so it operates more in a high availability mode rather than the more traditional load balancing mode? What I mean by this is that one upstream will always be used unless it fails, in which case, there’s a switch to the next upstream in line. This ensures the high availability of the upstream service while resolving the collaborative editing issue that would otherwise arise through load balancing in the usual sense.

I had a look through the lb_policy under Load balancing in the documentation, but nothing immediately sprung out at me. Any ideas?

francislavoie · July 18, 2021, 8:10pm

Yes, use lb_policy first. The default is lb_policy random.

The docs should be pretty self explanatory.

basil · July 18, 2021, 9:14pm

I’ve just tried this. I shut down the server offering the primary upstream service.and expected the secondary upstream service to kick in, but it didn’t appear to. Nextcloud responds as shown below.

For a visual representation of the arrangement, please refer to this TrueNAS community thread entry .Nextcloud and OnlyOffice Integration post #45.

Relevant Caddyfile excerpts (Note: I’ve tried with and without lb_try_duration)::

...
(proxy-host2) {
  @{args.0} host {args.0}.udance.com.au
  reverse_proxy @{args.0} {args.1} {args.2} {
    lb_policy first
    lb_try_duration 250ms
  }
}
...
*.udance.com.au {
...
    import proxy-host2  office          10.1.1.12:9880 10.1.1.13:8880
...
}

The Nextcloud-OnlyOffice connector is unaware that the OnlyOffice address refers to multiple upstream services:

francislavoie · July 18, 2021, 9:42pm

You need to also enable either, or both of, active or passive health checks for Caddy to recognize upstreams as down.

It should be enough to add fail_duration to tell Caddy how long to remember failed connections to an upstream (basically increments the counter, then sets a timer to decrement the counter later after the given duration).

(I agree that the docs could more clearly explain this aspect of it)

basil · July 18, 2021, 10:31pm

I’ve tried a bunch of stuff (without really understanding and being confident about what I’m doing) and I still haven’t been able to invoke the secondary service…

(proxy-host2) {
  @{args.0} host {args.0}.udance.com.au
  reverse_proxy @{args.0} {args.1} {args.2} {
    lb_policy first
    lb_try_duration 250ms
    fail_duration 2h
 #   health_uri 10.1.1.12
 #   health_interval 5s
 #   health_timeout 250ms
  }
}

I feel I could do with some additional guidance on this aspect.

francislavoie · July 19, 2021, 1:25am

Works for me with a config like this:

{
	debug
}

:7000 {
	reverse_proxy :7001 :7002 {
		lb_policy first
		lb_try_duration 5s
		fail_duration 30s
	}
}

# :7001 {
# 	respond "7001"
# }

:7002 {
	respond "7002"
}

You can play around with this running it like this:

$ caddy run --watch

And making requests like this, watching for the response (either 7001 or 7002 depending on the backend hit)

$ curl localhost:7000

And then comment in/out the :7001 block to take down the primary etc.

What I saw from testing that is that on my system, lb_try_duration had to be higher than 2s because it took 2 seconds for the dialer to error out with dial tcp :7001: connectex: No connection could be made because the target machine actively refused it. so if the try duration was less than 2 seconds it wouldn’t attempt to retry.

This might be different on your system, I’m not sure. But just look at your logs to see how long it takes for the errors to come back when trying to connect, then make lb_try_duration at least longer than that.

Edit: I noticed in the Caddy code that the default DialTimeout is set to 10s, so you could set this to something lower (like transport http { dial_timeout 2s } but with newlines obviously)

Setting it to 5s, I see in my debug logs that the dial timeout triggered after "duration": 2.0156536 (seconds) then doing another roundtrip 250ms later (the default lb_try_interval) on the secondary backend and returning that response.

Also fail_duration is how long to remember each failure attempt, so 2h is much too long. Using a value like 30s will mean that after the first failure, it’ll stop trying to connect to primary for the next 30 seconds after triggering the fallback, then forget about the failure and try again to connect to the primary. This does mean that one request every 30 seconds might get a small hiccup as long as your primary is down, but otherwise it would take an entire 2 hours for Caddy to realize that your primary is up again when only using passive health checks.

Seeing your commented out health_uri, that’s incorrect – that should be a request path (plus optional query if you need it) to use against the listed upstreams. So something like /health maybe if you have some endpoint on your upstream that returns a 200 status fast. A health endpoint usually entails just checking that you can connect to your database or something – it depends on what the app considers as being healthy but that’s usually a good place to start. If it’s a static file server, then just any page that returns status 200 would do.

basil · July 19, 2021, 4:24am

Wow! It’s a lot more involved than I thought. I’ll run some tests now and report back later. Thanks for the detailed guide. I notice debug is your friend., I should have thought of that.

Your response made me realise that there could still be a problem that cannot be solved unless Caddy is able to do time math. Let me explain further. From the OP, the issue I’m trying to address is that collaborative editing is possible within the upstream service, but not across upstream services. So, document collaboration isn’t possible if one user is editing a doc from one upstream instance while another user is editing the same doc from another upstream instance.

In an ideal world, the fail_duration should be something like ‘the difference between midnight and now’ so basically what I’m saying is ‘stay on the secondary service until midnight when it’s unlikely that anyone will be collaborating’. That should help explain the long duration. Unfortunately, this won’t help anyway unless Caddy can do time math. Thoughts?

francislavoie · July 19, 2021, 5:07am

You’d be better off by scripting your tooling to not reboot your “primary” until midnight then. Or write a custom lb_policy module which can handle this for you. It’s honestly a pretty strange requirement.

Sounds like a pretty poorly designed app if they didn’t take into consideration horizontal scaling for the collaborative stuff.

As a developer, how I generally solve scalability for real-time stuff is by using Redis as a backend to allow each instance to push data to eachother via pub/sub. I’ve written websocket servers which do this – if a message comes in from one user but that user is not connected to the same instance, it publishes a message via Redis to propagate it to any other instances which might be running and are connected/subscribed to Redis.

If it’s not designed with this in mind from the start, it’s usually pretty hard to add that kind of scalability functionality after the fact since it involves some pretty fundamental changes in how message passing is done.

Whitestrake · July 19, 2021, 5:43am

Yeah, what I’m hearing here is that you want an automatic failover (i.e. to second instance) but a delayed failback (the point where all connections are moved back to the primary).

HAProxy’s whole schtick is high availability, I wonder if they have some easy config that can achieve this kind of failback specification? You might be able to insert it between Caddy and the upstreams as a cheap alternative to writing a new policy module, who knows.

basil · July 19, 2021, 5:55am

Not practical as OO is just one of many services offered by the downed server.

There are two popular office suites for Nextcloud - OnlyOffice and Collabora CODE.

OO appears to be the newer kid on the block so, it’s less mature. However, it is more compatible with MS Office documents. This is because it uses the same Open XML document format. I scoured the OO documentation, but can’t find any info on scalability. I have a thread open on the OO community forum seeking clarification, but I’m not holding my breath.

CODE, on the other hand, does support scalability and with some additional configuration, is able to take advantage of Caddy lb defaults. The disadvantage though is that Collabora is tuned for ODF.

No matter, I’ll just let the TrueNAS community know that, for the moment, lb and document collaboration are mutually exclusive with OO, This is a limitation of OO and is not a Caddy issue.

basil · July 19, 2021, 6:07am

Thanks, Matthew, I’ll add it to my review list For the moment, I’ll add a footnote to any lb and OO communication I provide to the TN community. In the meantime, hopefully something positive around scalability comes back via the OO forum.

EDIT: Interesting. HAProxy is mentioned in CODE scalability.

basil · July 19, 2021, 7:05am

Ha! I’ve just stumbled across this recent OO blog post ONLYOFFICE App Server: microservice architecture for scalability and clustering. So, it appears to be on the cards, but hasn’t been delivered yet.

It’s interesting how a seemingly innocent question about lb has sent me down the rabbit hole

basil · July 19, 2021, 9:17am

francislavoie:

Works for me with a config like this:
{
	debug
}

:7000 {
	reverse_proxy :7001 :7002 {
		lb_policy first
		lb_try_duration 5s
		fail_duration 30s
	}
}

# :7001 {
# 	respond "7001"
# }

:7002 {
	respond "7002"
}
You can play around with this running it like this:
$ caddy run --watch
And making requests like this, watching for the response (either 7001 or 7002 depending on the backend hit)
$ curl localhost:7000
And then comment in/out the :7001 block to take down the primary etc.

What I saw from testing that is that on my system, lb_try_duration had to be higher than 2s because it took 2 seconds for the dialer to error out with dial tcp :7001: connectex: No connection could be made because the target machine actively refused it. so if the try duration was less than 2 seconds it wouldn’t attempt to retry.

This might be different on your system, I’m not sure. But just look at your logs to see how long it takes for the errors to come back when trying to connect, then make lb_try_duration at least longer than that.

Edit: I noticed in the Caddy code that the default DialTimeout is set to 10s, so you could set this to something lower (like transport http { dial_timeout 2s } but with newlines obviously)

Setting it to 5s, I see in my debug logs that the dial timeout triggered after "duration": 2.0156536 (seconds) then doing another roundtrip 250ms later (the default lb_try_interval) on the secondary backend and returning that response.

This test rig is so cool. It’s given me a much deeper insight and feel into how Caddy health checks work. I assume health checks are desirable irrespective of the lb_policy used.

Thanks for the clarification. I wasn’t sure before.

francislavoie · July 19, 2021, 9:28am

Yeah. They’re necessary for first to work at all, but not required for ones like random or round_robin. Health checks allow Caddy to more efficiently choose upstreams by skipping ones that are known to be unhealthy instead of trying to connect to them.

system · August 17, 2021, 8:09pm

This topic was automatically closed after 30 days. New replies are no longer allowed.