Works for me with a config like this:
{
debug
}
:7000 {
reverse_proxy :7001 :7002 {
lb_policy first
lb_try_duration 5s
fail_duration 30s
}
}
# :7001 {
# respond "7001"
# }
:7002 {
respond "7002"
}
You can play around with this running it like this:
$ caddy run --watch
And making requests like this, watching for the response (either 7001 or 7002 depending on the backend hit)
$ curl localhost:7000
And then comment in/out the :7001
block to take down the primary etc.
What I saw from testing that is that on my system, lb_try_duration
had to be higher than 2s because it took 2 seconds for the dialer to error out with dial tcp :7001: connectex: No connection could be made because the target machine actively refused it.
so if the try duration was less than 2 seconds it wouldn’t attempt to retry.
This might be different on your system, I’m not sure. But just look at your logs to see how long it takes for the errors to come back when trying to connect, then make lb_try_duration
at least longer than that.
Edit: I noticed in the Caddy code that the default DialTimeout
is set to 10s, so you could set this to something lower (like transport http { dial_timeout 2s }
but with newlines obviously)
Setting it to 5s, I see in my debug logs that the dial timeout triggered after "duration": 2.0156536
(seconds) then doing another roundtrip 250ms later (the default lb_try_interval
) on the secondary backend and returning that response.
Also fail_duration
is how long to remember each failure attempt, so 2h
is much too long. Using a value like 30s
will mean that after the first failure, it’ll stop trying to connect to primary for the next 30 seconds after triggering the fallback, then forget about the failure and try again to connect to the primary. This does mean that one request every 30 seconds might get a small hiccup as long as your primary is down, but otherwise it would take an entire 2 hours for Caddy to realize that your primary is up again when only using passive health checks.
Seeing your commented out health_uri
, that’s incorrect – that should be a request path (plus optional query if you need it) to use against the listed upstreams. So something like /health
maybe if you have some endpoint on your upstream that returns a 200
status fast. A health endpoint usually entails just checking that you can connect to your database or something – it depends on what the app considers as being healthy but that’s usually a good place to start. If it’s a static file server, then just any page that returns status 200
would do.