Zero downtime deployments

1. The problem I’m having:

I am using caddy as a SSL terminator and loadbalancer in front of docker (with nomad and consul).
The containers are stateless and the connections don’t require sticky sessions.

Yet, on deploy we are seeing 5xx errors as proxy upstreams get replaced with new instances.
We would like have it that the users won’t notice.

So we want to change the config to do so.

2. Error messages and/or full log output:

When deploying we are seeing 5xx status codes and then return to 2xx

HTTP/2 200

HTTP/2 503
...
HTTP/2 503 

HTTP/2 502
...
HTTP/2 502 

HTTP/2 200

3. Caddy version:

caddy:2.6.2-alpine

4. How I installed and ran Caddy:

Caddy is running inside docker.

a. System environment:

Debian Linux, x86, Docker

b. Command:

NA

c. Service/unit/compose file:

NA

d. My complete Caddy config:

Before:

dev.xolinoid.com {
  header /api* cache-control "no-cache"
  reverse_proxy /api* {
    dynamic srv {
      name backend-staging.service.datacenter1.consul
      refresh 60s
      dial_timeout 1s
      dial_fallback_delay -1s
    }
  }
}

After:

dev.xolinoid.com {
  header /api* cache-control "no-cache"
  reverse_proxy /api* {

    dynamic srv {
      name backend-staging.service.datacenter1.consul
      refresh 5s
      dial_timeout 1s
      dial_fallback_delay -1s
    }

    lb_try_duration 2s
    fail_duration 2s
    health_uri  /api/ping
    health_interval 10s
    health_timeout 2s
  }
}

5. Links to relevant resources:

These seem quite old:

The question

It feels like there are a lot of durations and timeouts to keep aligned, and I am a little lost.

      name backend-staging.service.datacenter1.consul
      refresh 5s
      dial_timeout 1s
      dial_fallback_delay -1s

This will refresh the dynamic service lookup every 5s. The query might only take 1s, but I cannot remember or understand why dial_fallback_delay was set to -1s when the default is 300ms.

This also means it will take 5s for the change in upstreams to be noticed.
Since we are currently doing a rolling deploy with 10s between instance deployments.
With a rollout delay of 10s, maybe the difference should be increased so the new instance will for sure be found before the rollout continues. Maybe “refresh 3s”?

    lb_try_duration 5s
    fail_duration 10s
    health_uri  /api/ping
    health_interval 10s
    health_timeout 2s

With lb_try_duration 5s the proxy will try the backend for a max of 5s until it moves on to the next upstream. Given there could be connection timeouts it might be better to have this shorter?

With fail fail_duration 10s a backend will be marked unreachable and will take 10s for checking again. This seems to be in line with a 10s rolling deploy. But shouldn’t this also be a short as possible? Basically as long as a service startup would take?

But now I am totally lost how this will interact with the active health monitoring, and whether that’s worth using in this case.

Are these reasonable numbers at all.
What would you change?

Thanks for the input!

You could just set it to -1, no need to have a number of seconds here.

It configures whether Happy Eyeballs - Wikipedia is used. By default, the delay is 300ms. A negative value turns off fast-fallback.

Not quite. It will try to find an upstream for up to 5s.

This controls how long Caddy will hold onto the request trying to connect to an upstream, until it gives up and returns an error to the client.

A shorter value means Caddy will give up faster, a longer value means it will keep trying longer until it manages to connect to an upstream.

Retries are separate from health checking. Upstream selection does uses the health status to help drive its decision making, and that controls whether an upstream will be tried at all during retries (unhealthy upstreams are ignored from retries).

This is how long Caddy remembers a failure for a particular upstream. When Caddy fails to connect to an upstream, it increments the fail count by one, then starts a 10s timer and at the end of the timer it decrements it by one. The max_fails option controls what count is required for the upstream to be considered unhealthy.

This is part of “passive health checking”.

It depends on your needs, as do all these options. This is moreso for avoiding pressure on upstreams that are down. It depends on the processing cost of requests, your requests per second rate, etc.

Active health checks are separate from passive health checks. It also depends on your needs whether it makes sense to enable.

If you have very low traffic, then enabling this will constantly cause some CPU usage on an interval because of making requests. This might be fine, but if you want near-zero energy costs, it might not be a good thing to have this on.

If you have a decently high amount of traffic, then health checks are a tiny drop in the bucket, but do help detect if things are ok before allowing traffic through.

If errors suddenly start happening, active health checks don’t notice until the next interval, but passive health checks could react more quickly to active problems.

I am still a little fuzzy what that means in practice.
I guess it only makes sense to turn off for a strict IPv4 setup? and otherwise would be good to leave at the default?

So it will pick upstreams depending on the lb algorithm. And it won’t stop picking until the lb_try_duration timer has run out or an available upstream is found.
Did I understand that correctly?

Even during a rollout the next try should always find a working upstream.
So it sounds like this should be kept as short as possible.

To me it sounds like this should be kept close to half of how long it takes for the upstream to restart.

Does that make sense?

Thanks for your input!

In practice, I don’t think you need to worry about it, you can even remove it. It’s not even looked at if you didn’t configure resolvers, and it only does something if your resolver returned a list of IPs which include both IPv4 and IPv6 (in which case it will try IPv6 first, then 300ms later try IPv4).

Yep.

Are you sure the next try will always find a working upstream? How is the rollout performed? Do you boot up the new version before shutting down the old one? Are you sure the new one will actually be ready to accept connections before the old one is shut down?

Using a longer lb_try_duration can help bridge that gap by allowing Caddy more time to find a backend it can connect to.

:man_shrugging:

You might need to test it out and see how it behaves.

Error during parsing: bad delay value '-1': time: missing unit in duration "-1"

FWIW - that doesn’t seem to be OK.

Other than that, it seems like I got a working setup now.
Still a little more testing needed - but looking good.

Thanks for the help!

1 Like

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.