Blue / Green Deployment without downtime

daaku · April 17, 2024, 5:34am

1. The problem I’m having:

On a single server, I have Caddy running which serves various sites. This used to be a haproxy setup, which has now been replaced by Caddy. The sites are deployed blue / green style, with a pair of ports assigned to blue / green respectively. The deploy process for the site is roughly:

Checks which port is currently active, choose the other port.
Deploy the new version of the site on the other port.
Health check the new version of the site on the other port.
Shutdown the old version of the site on the current port.

With haproxy, this config served me well:

backend myweb
  stick-table type ip size 1m
  stick on dst
  mode http
  server green 127.0.0.1:9000 check
  server blue  127.0.0.1:9001 check
  timeout tunnel 10h

With Caddy, I currently have this config which isn’t identical, but mostly works:

reverse_proxy 127.0.0.1:9000-9001 {
	health_uri /healthz/
	lb_try_duration 5s
}

I’m facing 2 issues which I’m struggling to resolve.

The first request after a deploy fails. Caddy tries for 5s, then fails the request. Subsequent requests work.
Since this is using the active health check, caddy is continuously trying to check the failed server. I would like this to not happen, and would like to only check when the current server fails. It also spews in the logs.

2. Error messages and/or full log output:

{"level":"info","ts":1713331751.4350946,"logger":"http.handlers.reverse_proxy.health_checker.active","msg":"HTTP request failed","host":"127.0.0.1:9000","error":"Get \"http://127.0.0.1:9000/healthz/\": dial tcp 127.0.0.1:9000: connect: connection refused"}

3. Caddy version:

v2.7.6 h1:w0NymbG2m9PcvKWsrXO6EEkY9Ru4FJK8uQbYcev1p3A=

4. How I installed and ran Caddy:

xcaddy build v2.7.6 \
  --with github.com/caddy-dns/cloudflare \
  --with github.com/greenpau/caddy-security

a. System environment:

Arch Linux, caddy running as a systemd service.

b. Command:

/usr/bin/caddy run --config /etc/caddy/Caddyfile

c. My complete Caddy config:

testsite.daaku.org {
	reverse_proxy 127.0.0.1:9000-9001 {
		health_uri /healthz/
		lb_try_duration 5s
	}
}

daaku · April 17, 2024, 6:17am

In typical fashion, I think I have found a working configuration that solves both my problems soon after asking for help. The first request after a deploy works, and health checks are now passive instead of active:

testsite.daaku.org {
	reverse_proxy 127.0.0.1:9000-9001 {
		fail_duration 30s
		lb_policy first
		lb_retries 2
	}
}

Need to verify if 2 deploys within 30s will be an issue, but that seems like an edge case for me.

francislavoie · April 17, 2024, 7:29am

I think you could get around the 2-deploys issue by setting lb_try_duration to slightly longer than fail_duration, so it gets a chance to retry afterwards. Maybe 5s fail duration with 7s or 8s try duration? Probably don’t need fail duration to be as long as 30s.

daaku · April 17, 2024, 7:56am

So to clarify, you’re suggesting something like this:

testsite.daaku.org {
	reverse_proxy 127.0.0.1:9000-9001 {
		fail_duration 5s
		lb_policy first
		lb_try_duration 8s
	}
}

And if I understand correctly, this would mean when a request fails due to the deploy shutting down the old instance, we’ll keep retrying for 5s? Additionally an instance will only be marked down for 5s after it fails a request?

francislavoie · April 17, 2024, 3:53pm

It’ll keep retrying for 8s, which is longer than the 5s of fail_duration so it’s likely to get a 2nd try on the same upstream that went down just before the request came in. The first try could have marked the upstream as down, then if both are down for a time then it would be able to “forget” that the first one failed by the time it gives up retries. That’s the theory anyway.

daaku · April 17, 2024, 5:41pm

Ah, makes sense about the forgetting the downed instance. One last question, do I still need lb_retries with this setup?

francislavoie · April 18, 2024, 4:53pm

You probably don’t need lb_retries, no. From the docs:

If lb_try_duration is also configured, then retries may stop early if the duration is reached. In other words, the retry duration takes precedence over the retry count.

daaku · April 19, 2024, 7:00am

I think I found another wrinkle, which essentially makes this setup non-functional. I was testing with GET requests, so did not notice this issue. Even with small POST requests the issue is not reproducible. I believe larger than 4k requests (default read buffer size) are how this issue gets triggered.

Essentially, when Caddy doesn’t remember any failures associated with either of the 2 backends (on restart, or after fail_duration), and a POST request with body greater than 4k in size comes in, it will try the first backend. If that backend is down, it will fail, and won’t retry. This happens regularly as Caddy forgets that a backend is down.

I’m still investigating and trying to confirm my understanding, and checking to see if I can find any fixes besides buffering the entire request body.

francislavoie · April 19, 2024, 11:48pm

The request should be safely retried even if it’s a POST as long as the failures are connection failures rather than other kinds of errors.

Can you show your Caddy logs from a failing request when a backend is down? It should tell us what kind of error you get.

The issue is that other kinds of errors if they happen after Caddy tried to write the body upstream, since Caddy doesn’t buffer the request body (just streams it), it’s not possible to send the POST body again afterwards.

The logic for retries is in here

github.com

caddyserver/caddy/blob/d00824f4a648238cadacd1c999cedcba5f40323e/modules/caddyhttp/reverseproxy/reverseproxy.go#L1038


      
          	// if we've reached the retry limit, break
          	if lb.Retries > 0 && retries >= lb.Retries {
          		return false
          	}
          
          	// if the error occurred while dialing (i.e. a connection
          	// could not even be established to the upstream), then it
          	// should be safe to retry, since without a connection, no
          	// HTTP request can be transmitted; but if the error is not
          	// specifically a dialer error, we need to be careful
          	if proxyErr != nil {
          		_, isDialError := proxyErr.(DialError)
          		herr, isHandlerError := proxyErr.(caddyhttp.HandlerError)
          
          		// if the error occurred after a connection was established,
          		// we have to assume the upstream received the request, and
          		// retries need to be carefully decided, because some requests
          		// are not idempotent
          		if !isDialError && !(isHandlerError && errors.Is(herr, errNoUpstream)) {
          			if lb.RetryMatch == nil && req.Method != "GET" {
          				// by default, don't retry requests if they aren't GET

daaku · April 20, 2024, 7:40am

Here’s the log line with the error:

Apr 20 07:35:16 daaku.org caddy[1741]: {"level":"error","ts":1713598516.1596339,"logger":"http.log.error","msg":"readfrom tcp 100.85.176.70:39494->100.72.108.128:34524: body closed by handler","request":{"remote_ip":"94.205.44.30","remote_port":"38842","client_ip":"94.205.44.30","proto":"HTTP/2.0","method":"POST","host":"mysite.daaku.org","uri":"/colaz","headers":{"Content-Length":["8278"],"User-Agent":["curl/8.7.1"],"Accept":["*/*"],"Content-Type":["application/json"],"X-Hub-Signature":["sha1=cc10ccd700773fcfc359441cba36942993b10815"]},"tls":{"resumed":false,"version":772,"cipher_suite":4865,"proto":"h2","server_name":"mysite.daaku.org"}},"duration":0.01549551,"status":502,"err_id":"797j8rumx","err_trace":"reverseproxy.statusError (reverseproxy.go:1267)"}

The reverse proxy destination is a tailscale IP address. There is no server listening on this port, but maybe tailscale alters the TCP error somehow and Caddy doesn’t recognize it as a connection failure?

francislavoie · April 20, 2024, 6:21pm

Oh, that might explain it then. You’re not making a direct TCP connection to the actual server, you’re using a tunnel which is available but doesn’t succeed to forward the connection, so it doesn’t act like an actual TCP connection failure.

We could probably special-case a few of these classes of errors but it’s kinda tricky because all we get from the Go stdlib is error strings (not named error types) for this stuff. See net/http: Too hard to tell if a RoundTrip error came from reading from the Body or from talking to the target server · Issue #18272 · golang/go · GitHub and net/http: Transport.RoundTrip errors could be more informative · Issue #13667 · golang/go · GitHub which are complaints about this (still unresolved).

daaku · April 21, 2024, 11:15am

You are correct - tailscale has a different error (localhost returns the expected connection refused error). I will look for a fix for my specific situation.

Thanks for your guidance!

francislavoie · April 22, 2024, 7:16pm

Matt just made a commit on master which might resolve the problem with your case. Can you try building from this commit?

All you need to do is enable request_buffers to buffer the body before sending it upstream.

You might also need to enable lb_retry_match in your proxy config to allow POST to be retried, but I have concerns that this’ll cause retries even for non-connection errors when the upstream might have already used the POST body to do somekind of write operation.

daaku · April 26, 2024, 4:25am

I’ve changed my setup to just accept the minimal downtime during deploys and kept the service on just one port. Buffering/POST retries adds unnecessary complexity for my use case.

It’s unfortunate tailscale is somehow mucking with the connection refused error. That feels like the right thing to fix here.

Thanks for your help!

system · May 26, 2024, 4:26am

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.