1. The problem I’m having:
I am using caddy as a SSL terminator and loadbalancer in front of docker (with nomad and consul).
The containers are stateless and the connections don’t require sticky sessions.
Yet, on deploy we are seeing 5xx errors as proxy upstreams get replaced with new instances.
We would like have it that the users won’t notice.
So we want to change the config to do so.
2. Error messages and/or full log output:
When deploying we are seeing 5xx status codes and then return to 2xx
HTTP/2 200
HTTP/2 503
...
HTTP/2 503
HTTP/2 502
...
HTTP/2 502
HTTP/2 200
3. Caddy version:
caddy:2.6.2-alpine
4. How I installed and ran Caddy:
Caddy is running inside docker.
a. System environment:
Debian Linux, x86, Docker
b. Command:
NA
c. Service/unit/compose file:
NA
d. My complete Caddy config:
Before:
dev.xolinoid.com {
header /api* cache-control "no-cache"
reverse_proxy /api* {
dynamic srv {
name backend-staging.service.datacenter1.consul
refresh 60s
dial_timeout 1s
dial_fallback_delay -1s
}
}
}
After:
dev.xolinoid.com {
header /api* cache-control "no-cache"
reverse_proxy /api* {
dynamic srv {
name backend-staging.service.datacenter1.consul
refresh 5s
dial_timeout 1s
dial_fallback_delay -1s
}
lb_try_duration 2s
fail_duration 2s
health_uri /api/ping
health_interval 10s
health_timeout 2s
}
}
5. Links to relevant resources:
These seem quite old:
The question
It feels like there are a lot of durations and timeouts to keep aligned, and I am a little lost.
name backend-staging.service.datacenter1.consul
refresh 5s
dial_timeout 1s
dial_fallback_delay -1s
This will refresh the dynamic service lookup every 5s. The query might only take 1s, but I cannot remember or understand why dial_fallback_delay
was set to -1s
when the default is 300ms.
This also means it will take 5s for the change in upstreams to be noticed.
Since we are currently doing a rolling deploy with 10s between instance deployments.
With a rollout delay of 10s, maybe the difference should be increased so the new instance will for sure be found before the rollout continues. Maybe “refresh 3s”?
lb_try_duration 5s
fail_duration 10s
health_uri /api/ping
health_interval 10s
health_timeout 2s
With lb_try_duration 5s
the proxy will try the backend for a max of 5s until it moves on to the next upstream. Given there could be connection timeouts it might be better to have this shorter?
With fail fail_duration 10s
a backend will be marked unreachable and will take 10s for checking again. This seems to be in line with a 10s rolling deploy. But shouldn’t this also be a short as possible? Basically as long as a service startup would take?
But now I am totally lost how this will interact with the active health monitoring, and whether that’s worth using in this case.
Are these reasonable numbers at all.
What would you change?
Thanks for the input!