Blue / Green Deployment without downtime

1. The problem I’m having:

On a single server, I have Caddy running which serves various sites. This used to be a haproxy setup, which has now been replaced by Caddy. The sites are deployed blue / green style, with a pair of ports assigned to blue / green respectively. The deploy process for the site is roughly:

  1. Checks which port is currently active, choose the other port.
  2. Deploy the new version of the site on the other port.
  3. Health check the new version of the site on the other port.
  4. Shutdown the old version of the site on the current port.

With haproxy, this config served me well:

backend myweb
  stick-table type ip size 1m
  stick on dst
  mode http
  server green 127.0.0.1:9000 check
  server blue  127.0.0.1:9001 check
  timeout tunnel 10h

With Caddy, I currently have this config which isn’t identical, but mostly works:

reverse_proxy 127.0.0.1:9000-9001 {
	health_uri /healthz/
	lb_try_duration 5s
}

I’m facing 2 issues which I’m struggling to resolve.

  1. The first request after a deploy fails. Caddy tries for 5s, then fails the request. Subsequent requests work.
  2. Since this is using the active health check, caddy is continuously trying to check the failed server. I would like this to not happen, and would like to only check when the current server fails. It also spews in the logs.

2. Error messages and/or full log output:

{"level":"info","ts":1713331751.4350946,"logger":"http.handlers.reverse_proxy.health_checker.active","msg":"HTTP request failed","host":"127.0.0.1:9000","error":"Get \"http://127.0.0.1:9000/healthz/\": dial tcp 127.0.0.1:9000: connect: connection refused"}

3. Caddy version:

v2.7.6 h1:w0NymbG2m9PcvKWsrXO6EEkY9Ru4FJK8uQbYcev1p3A=

4. How I installed and ran Caddy:

xcaddy build v2.7.6 \
  --with github.com/caddy-dns/cloudflare \
  --with github.com/greenpau/caddy-security

a. System environment:

Arch Linux, caddy running as a systemd service.

b. Command:

/usr/bin/caddy run --config /etc/caddy/Caddyfile

c. My complete Caddy config:

testsite.daaku.org {
	reverse_proxy 127.0.0.1:9000-9001 {
		health_uri /healthz/
		lb_try_duration 5s
	}
}

In typical fashion, I think I have found a working configuration that solves both my problems soon after asking for help. The first request after a deploy works, and health checks are now passive instead of active:

testsite.daaku.org {
	reverse_proxy 127.0.0.1:9000-9001 {
		fail_duration 30s
		lb_policy first
		lb_retries 2
	}
}

Need to verify if 2 deploys within 30s will be an issue, but that seems like an edge case for me.

1 Like

I think you could get around the 2-deploys issue by setting lb_try_duration to slightly longer than fail_duration, so it gets a chance to retry afterwards. Maybe 5s fail duration with 7s or 8s try duration? Probably don’t need fail duration to be as long as 30s.

So to clarify, you’re suggesting something like this:

testsite.daaku.org {
	reverse_proxy 127.0.0.1:9000-9001 {
		fail_duration 5s
		lb_policy first
		lb_try_duration 8s
	}
}

And if I understand correctly, this would mean when a request fails due to the deploy shutting down the old instance, we’ll keep retrying for 5s? Additionally an instance will only be marked down for 5s after it fails a request?

It’ll keep retrying for 8s, which is longer than the 5s of fail_duration so it’s likely to get a 2nd try on the same upstream that went down just before the request came in. The first try could have marked the upstream as down, then if both are down for a time then it would be able to “forget” that the first one failed by the time it gives up retries. That’s the theory anyway.

Ah, makes sense about the forgetting the downed instance. One last question, do I still need lb_retries with this setup?

You probably don’t need lb_retries, no. From the docs:

If lb_try_duration is also configured, then retries may stop early if the duration is reached. In other words, the retry duration takes precedence over the retry count.

I think I found another wrinkle, which essentially makes this setup non-functional. I was testing with GET requests, so did not notice this issue. Even with small POST requests the issue is not reproducible. I believe larger than 4k requests (default read buffer size) are how this issue gets triggered.

Essentially, when Caddy doesn’t remember any failures associated with either of the 2 backends (on restart, or after fail_duration), and a POST request with body greater than 4k in size comes in, it will try the first backend. If that backend is down, it will fail, and won’t retry. This happens regularly as Caddy forgets that a backend is down.

I’m still investigating and trying to confirm my understanding, and checking to see if I can find any fixes besides buffering the entire request body.

The request should be safely retried even if it’s a POST as long as the failures are connection failures rather than other kinds of errors.

Can you show your Caddy logs from a failing request when a backend is down? It should tell us what kind of error you get.

The issue is that other kinds of errors if they happen after Caddy tried to write the body upstream, since Caddy doesn’t buffer the request body (just streams it), it’s not possible to send the POST body again afterwards.

The logic for retries is in here

Here’s the log line with the error:

Apr 20 07:35:16 daaku.org caddy[1741]: {"level":"error","ts":1713598516.1596339,"logger":"http.log.error","msg":"readfrom tcp 100.85.176.70:39494->100.72.108.128:34524: body closed by handler","request":{"remote_ip":"94.205.44.30","remote_port":"38842","client_ip":"94.205.44.30","proto":"HTTP/2.0","method":"POST","host":"mysite.daaku.org","uri":"/colaz","headers":{"Content-Length":["8278"],"User-Agent":["curl/8.7.1"],"Accept":["*/*"],"Content-Type":["application/json"],"X-Hub-Signature":["sha1=cc10ccd700773fcfc359441cba36942993b10815"]},"tls":{"resumed":false,"version":772,"cipher_suite":4865,"proto":"h2","server_name":"mysite.daaku.org"}},"duration":0.01549551,"status":502,"err_id":"797j8rumx","err_trace":"reverseproxy.statusError (reverseproxy.go:1267)"}

The reverse proxy destination is a tailscale IP address. There is no server listening on this port, but maybe tailscale alters the TCP error somehow and Caddy doesn’t recognize it as a connection failure?

Oh, that might explain it then. You’re not making a direct TCP connection to the actual server, you’re using a tunnel which is available but doesn’t succeed to forward the connection, so it doesn’t act like an actual TCP connection failure.

We could probably special-case a few of these classes of errors but it’s kinda tricky because all we get from the Go stdlib is error strings (not named error types) for this stuff. See net/http: Too hard to tell if a RoundTrip error came from reading from the Body or from talking to the target server · Issue #18272 · golang/go · GitHub and net/http: Transport.RoundTrip errors could be more informative · Issue #13667 · golang/go · GitHub which are complaints about this (still unresolved).

You are correct - tailscale has a different error (localhost returns the expected connection refused error). I will look for a fix for my specific situation.

Thanks for your guidance!

Matt just made a commit on master which might resolve the problem with your case. Can you try building from this commit?

All you need to do is enable request_buffers to buffer the body before sending it upstream.

You might also need to enable lb_retry_match in your proxy config to allow POST to be retried, but I have concerns that this’ll cause retries even for non-connection errors when the upstream might have already used the POST body to do somekind of write operation.

I’ve changed my setup to just accept the minimal downtime during deploys and kept the service on just one port. Buffering/POST retries adds unnecessary complexity for my use case.

It’s unfortunate tailscale is somehow mucking with the connection refused error. That feels like the right thing to fix here.

Thanks for your help!