Long lived WSS connections

james · December 13, 2022, 3:01am

1. Output of `caddy version`:

Not yet using Caddy but will run the latest

2. How I run Caddy:

Will run on Debian machines potentially in a cluster using Redis to store certificates

a. System environment:

Will run Caddy under systemd likely and be on bare metal running Debian (Unsure what version, sorry!)

b. Command:

Unsure yet

c. Service/unit/compose file:

Will likely be the default service file (not yet confirmed)

d. My complete Caddy config:

The config will likely look something like this (yet to implement):

subdomain1.example.com {
	handle /wss* {
		reverse_proxy app1:4000
	}
	reverse_proxy app2:4001
}

subdomain2.example.com {
	handle /wss* {
		reverse_proxy app1:4002
	}
	reverse_proxy app2:4003
}

<repeat a large amount of times>

3. The problem I’m having:

Hi all! First off I’m not having a problem with Caddy but looking at getting some advice to see whether it will have the same problem Nginx does. I am also not trying to start a Nginx vs Caddy argument, just trying to understand whether it will have the same issue.

A friend of mine is running a medium sized operation that has a few Nginx “gateway” servers that proxy traffic to application servers sitting behind it. Excuse the spaghetti but hopefully this diagram makes some sense to describe the setup.

This setup works fine but runs into an issue when reloading Nginx. The application running on this setup uses long lived websocket connections to give live updates to clients. When reloading Nginx to do config updates these requests don’t die so workers end up staying around. This leads to higher memory usage on every reload (~500MB each time, give or take 50MB or so) and eventually a full restart is required to get the memory usage back under control which sucks because it essentially means you have an outage.

Nginx’s solution is to set a worker_shutdown_timeout so that the older workers do eventually shutdown even if there are still websocket connections active but this obviously breaks those connections.

Reloads need to happen somewhat frequently as new servers or application instances are added/removed.

I understand why this needs to happen with Nginx so what I’m trying to understand is whether Caddy reloads would run into this same limitation? The setup would be running on Debian and we can use the Admin API to either push updates or more than likely just let Caddy push updates to a Caddyfile change through the default systemd handler.

Does Caddy need to fork new child procs to handle new configurations or does it hot load the new config into the proc somehow? Hopefully what I’m asking makes sense but please let me know if not!

I also want to state I completely understand all of the other amazing features Caddy can bring to the table to help this setup. At the moment this current issue is the most pressing but Caddy’s handling of certificates will also be a major focus in the future if thats relevant. A Caddy migration would be ideal and likely, just want to make sure I understand if we need to add some special handling for this particular issue like moving websocket connections to a dedicated domain like wss.example.com or something.

4. Error messages and/or full log output:

N/A

5. What I already tried:

Nothing yet, just seeking information!

6. Links to relevant resources:

Nginx docs on worker shutdown timeout Core functionality

WeidiDeng · December 13, 2022, 4:01am

No, caddy will automatically close all websocket connections on config reload. Caddy will load new configuration then wait for all requests to finish (unless grace_period is set), finally new config is in effect. Caddy will only spawn new goroutines which are very memory efficient.

james · December 13, 2022, 5:08am

Thanks for the reply! Just to make sure I’m understanding you correctly:

if grace_period is set Caddy will wait for connections to close until the configured length then forcibly close them
If grace_period isn’t set it defaults to 0 which means the connections can stay alive forever and Caddy will wait for them to close?

I’m slightly confused. Would that mean in this instance Caddy could potentially never get the new config?

matt · December 13, 2022, 5:21am

Welcome back James!

Yes. In Caddy 2.6, we made some relevant changes/improvements:

Proxied WebSockets are closed as gracefully as possible during a config change.
Config reloads no longer block waiting for the grace period to finish.

From the 2.6 release notes:

Speaking of grace periods, config changes no longer block while waiting on servers’ grace periods. This means faster, more responsive config reloads; just beware that, depending on the length of your grace period, your reload command or config API request may return before the old servers have completely finished shutting down.

If connections are left open (for example, large file downloads that are progressing slowly), this can theoretically run up your memory bill if you have, say, 100 reloads / minute and your grace period is, I dunno, longer than that? Like 5 minutes… as they will be overlapping for some time.

This new behavior was requested by one of our sponsors who is deploying Caddy at considerable scale. They have lots of WebSockets going on, and haven’t reported any memory problems to me.

So the grace period is probably similar to nginx’s worker_shutdown_timeout – but as @WeidiDeng said in his excellent answer, Caddy does its best to gracefully close websocket connections when the server is unloaded, so it shouldn’t be an issue in most cases, I’d imagine.

Yes, but remember that Caddy actively closes proxied connections that are recognized as WebSockets. The grace period is still important if you have clients with lots of packet loss, or slow, long-lived downloads, for example.

Caddy will always start using the new config; the new server (configuration) is already listening and accepting connections before the old server (configuration) stops accepting new connections and closes remaining ones.

james · December 13, 2022, 5:46am

Thanks!

Interesting! I don’t know for certain how Nginx does it but I wonder if it forcefully closes them so the browser doesn’t retry/reconnect vs if its done gracefully like Caddy does? I’ll need to test it and see what happens. Reading the Nginx docs it isn’t clear whether when it hits worker_shutdown_timeout if connections are closed gracefully or forcefully. Either way its likely way better than a full restart of Nginx.

If connections are left open (for example, large file downloads that are progressing slowly), this can theoretically run up your memory bill if you have, say, 100 reloads / minute and your grace period is, I dunno, longer than that? Like 5 minutes… as they will be overlapping for some time.

I don’t know for certain but I think the reloads are more several times a day, I think somewhere in the region of maybe 5-10 times a day. Not massive by just the memory leak size causes it to become a problem. I think there is somewhere in the region of ~1,000 different SSL certificates + server blocks configured so its just a slow process.

Yes, but remember that Caddy actively closes proxied connections that are recognized as WebSockets. The grace period is still important if you have clients with lots of packet loss, or slow, long-lived downloads, for example.

Understood! I don’t think non-websocket requests are a problem in this situation but handy to know.

Thanks everyone for your answers! I’ll feed back what I’ve found out and go from there.

matt · December 13, 2022, 6:12am

Sounds good, keep us posted. What is this “medium sized operation”?

francislavoie · December 13, 2022, 11:06am

I don’t know how Nginx does it either, but what Caddy does is send a websocket close frame, which is basically a small bit of data telling the client “hey, this connection is being closed”, and then drops the connection. It sends that message to both sides (both to the downstream client, and the upstream websocket/app server).

Browser JS code should typically be set up to attempt reconnecting if the connection is closed, unless the intent is actually to stop the websocket connection (e.g. navigating to a view that doesn’t need the connection, I dunno, depends on the app).

There was a bug in v2.6.2 relating to close frames being sent to the upstream server that I’ve fixed, it should be released in v2.6.3 shortly. reverseproxy: Mask the WS close message when we're the client by francislavoie · Pull Request #5199 · caddyserver/caddy · GitHub

system · January 12, 2023, 3:01am

This topic was automatically closed after 30 days. New replies are no longer allowed.