Mqtt/tcp keepalive when proxying websocket

1. Caddy version (caddy version):

v2.4.6 h1:HGkGICFGvyrodcqOOclHKfvJC0qTU7vny/7FhYp9hNw=

2. How I run Caddy:

docker-compose

c. Service/unit/compose file:

caddy:
  environment:
    - CADDY_TLS
    - DOMAIN_NAME
    - TZ=Europe/London
  image: caddy
  ports:
    - "443:443"
    - "80:80"
  restart: always
  volumes:
    - ./deployment/Caddyfile:/etc/caddy/Caddyfile
    - caddy_config:/config
    - caddy_data:/data
    - django_static:/app/static
    - django_media:/app/media

rabbitmq:
    container_name: rabbitmq
    environment:
      - RABBITMQ_SERVER_ADDITIONAL_ERL_ARGS=-rabbitmq_management path_prefix "/rabbitmq"
      - DEFAULT_USER=admin
      - DEFAULT_PASS=${RABBITMQ_DEFAULT_PASS}
    hostname: rabbitmq  # must be set for data to be persisted - https://github.com/docker-library/rabbitmq/issues/106
    image: rabbitmq:3-management-alpine
    ports:
      - "1883:1883"
    restart: always
    volumes:
      - rabbitmq_data:/var/lib/rabbitmq
      - ./deployment/rabbitmq_plugins:/etc/rabbitmq/enabled_plugins:ro

d. My complete Caddyfile or JSON config:

{$DOMAIN_NAME} {
	# serve some files directly from caddy rather than through django/unicorn for performance
	root * /app # needed for file_server
	file_server /static/* # assets e.g. js, css
	file_server /media/* # uploaded files e.g. OTA updates
	rewrite /favicon.ico /static/images/favicon.ico

	# various reverse proxies which strip the TLS part and pass through http
	reverse_proxy /ws rabbitmq:15675 # rabbitmq websocket
	reverse_proxy /rabbitmq/* rabbitmq:15672 # rabbitmq admin
	reverse_proxy /grafana/* grafana:3000 # grafana for internal use
	reverse_proxy /portal/* portal:3000 # custom grafana instance for landlord portal

	# for all other web requests and not /static or /media, proxy to Django
	@not_served {
		not path /static/* /media/*
	}
	reverse_proxy @not_served web:8000

	tls {$CADDY_TLS} # generate ssl cert automatically
	encode gzip # encode gzip for performance
	log {
		level WARN
	}
}

# allow both http and https for firmware as needed by some devices
http://{$DOMAIN_NAME} {
	handle /media/firmware* {
		root * /app
		file_server
	}

	# fallback to redirect
	handle {
		redir https://{host}{uri} 308
	}
}

3. The problem I’m having:

I am connecting devices to a rabbitmq container using mqtt-over-wss but using Caddy to terminate the ssl connection. So the device connects to caddy port 443 using wss://, and between caddy and rabbitmq it’s plain ws://.
I did it this way because I thought it was a good way to avoid having to generate my own CA certs etc in rabbitmq when I can just use the ones in Caddy, and also because the connection is less likely to be blocked by firewalls if it’s port 443. But no-one else is doing it this way so it might be a terrible idea.

Everything works fine but now that I’m connecting devices through 4G and have a limited data allowance I have noticed that an mqtt keepalive is being sent every 15s which is unnecessarily using data. If I increase this value to 120s or above (so sending a packet every 60s) then the mqtt connection gets closed (see log below). If I use a value of 115s then the connection stays open but TCP keepalive packets are sent every 15s instead, as can be seen by the packet capture below:

Time	Source	Destination	       Protocol	    Length	Info
33.8	172.18.0.6	<client>	TCP	    54	443 → 30441 [ACK] Seq=2 Ack=114 Win=63989 Len=0
33.8	172.18.0.6	<client>	TLSv1.2	89	Application Data
34.0	<client>	172.18.0.6	TCP	    54	30441 → 443 [ACK] Seq=114 Ack=37 Win=4710 Len=0
49.1	172.18.0.6	<client>	TCP	    54	[TCP Keep-Alive] 443 → 30441 [ACK] Seq=36 Ack=114 Win=63989 Len=0
49.6	<client>	172.18.0.6	TCP	    54	[TCP Keep-Alive ACK] 30441 → 443 [ACK] Seq=114 Ack=37 Win=4710 Len=0
64.6	172.18.0.6	<client>	TCP	    54	[TCP Keep-Alive] 443 → 30441 [ACK] Seq=36 Ack=114 Win=63989 Len=0
65.7	<client>	172.18.0.6	TCP	    54	[TCP Keep-Alive ACK] 30441 → 443 [ACK] Seq=114 Ack=37 Win=4710 Len=0
80.7	172.18.0.6	<client>	TCP	    54	[TCP Keep-Alive] 443 → 30441 [ACK] Seq=36 Ack=114 Win=63989 Len=0
81.7	<client>	172.18.0.6	TCP	    54	[TCP Keep-Alive ACK] 30441 → 443 [ACK] Seq=114 Ack=37 Win=4710 Len=0
92.2	<client>	172.18.0.6	TLSv1.2	89	Application Data
92.2	172.18.0.6	<client>	TCP	    54	443 → 30441 [ACK] Seq=37 Ack=149 Win=63989 Len=0
92.5	<client>	172.18.0.6	TLSv1.2	85	Application Data
92.5	172.18.0.6	<client>	TCP	    54	443 → 30441 [ACK] Seq=37 Ack=180 Win=63989 Len=0
92.5	172.18.0.6	<client>	TLSv1.2	87	Application Data
92.7	<client>	172.18.0.6	TCP	    54	30441 → 443 [ACK] Seq=180 Ack=70 Win=4677 Len=0
108.0	172.18.0.6	<client>	TCP	    54	[TCP Keep-Alive] 443 → 30441 [ACK] Seq=69 Ack=180 Win=63989 Len=0
108.5	<client>	172.18.0.6	TCP	    54	[TCP Keep-Alive ACK] 30441 → 443 [ACK] Seq=180 Ack=70 Win=4677 Len=0
123.5	172.18.0.6	<client>	TCP	    54	[TCP Keep-Alive] 443 → 30441 [ACK] Seq=69 Ack=180 Win=63989 Len=0
124.5	<client>	172.18.0.6	TCP	    54	[TCP Keep-Alive ACK] 30441 → 443 [ACK] Seq=180 Ack=70 Win=4677 Len=0
139.5	172.18.0.6	<client>	TCP	    54	[TCP Keep-Alive] 443 → 30441 [ACK] Seq=69 Ack=180 Win=63989 Len=0
140.6	<client>	172.18.0.6	TCP	    54	[TCP Keep-Alive ACK] 30441 → 443 [ACK] Seq=180 Ack=70 Win=4677 Len=0
149.8	<client>	172.18.0.6	TLSv1.2	89	Application Data
149.8	172.18.0.6	<client>	TCP	    54	443 → 30441 [ACK] Seq=70 Ack=215 Win=63989 Len=0
150.1	<client>	172.18.0.6	TLSv1.2	85	Application Data
150.1	172.18.0.6	<client>	TCP	    54	443 → 30441 [ACK] Seq=70 Ack=246 Win=63989 Len=0
150.1	172.18.0.6	<client>	TLSv1.2	87	Application Data
150.3	<client>	172.18.0.6	TCP	54	30441 → 443 [ACK] Seq=246 Ack=103 Win=4644 Len=0

You can see that MQTT keepalive packets are sent at 33s, 92s, 150s (roughly 60s apart) which is right but then TCP keepalive packets are being sent in between.

If I connect a device straight to rabbitmq (not through Caddy) using plain MQTT on port 1883 then I have no problem using longer MQTT keepalive, which makes me believe that it’s the Caddy proxy which is causing the problem and not something else. I also found that Golang sets a default TCP keepalive of 15s which makes me think caddy is generating them.
So I am wondering if there’s a setting somewhere which controls a) the tcp keepalives and b) when the proxy is closed due to inactivity. I am assuming here that Caddy is closing the proxy for some reason after a minute of inactivity but I could be wrong about this.

I am coming round to the idea that perhaps I just need to connect directly to rabbitmq using mqtts which would probably result in less problems but I thought I’d just check if there’s a way of making it work through Caddy first.

4. Error messages and/or full log output:

This is the log output from the client when the keepalive is set to a high value, which results in mqtt disconnecting frequently:

I (16:27:34.102) MQTT: MQTT Connected
I (16:28:43.165) TRANSPORT_WS: Got CLOSE frame with status code=1000
W (16:28:43.173) TRANSPORT_WS: esp_transport_ws_poll_connection_closed: unexpected data readable on socket=57
W (16:28:43.174) TRANSPORT_WS: Connection terminated while waiting for clean TCP close
E (16:28:43.183) MQTT_CLIENT: mqtt_message_receive: transport_read() error: errno=119
E (16:28:43.192) MQTT_CLIENT: mqtt_process_receive: mqtt_message_receive() returned -1
W (16:28:43.204) MQTT: MQTT disconnected
I (16:29:02.113) MQTT: MQTT Connected
I (16:30:03.170) TRANSPORT_WS: Got CLOSE frame with status code=1000
W (16:30:03.177) TRANSPORT_WS: esp_transport_ws_poll_connection_closed: unexpected data readable on socket=57
W (16:30:03.178) TRANSPORT_WS: Connection terminated while waiting for clean TCP close
E (16:30:03.188) MQTT_CLIENT: mqtt_message_receive: transport_read() error: errno=119
E (16:30:03.197) MQTT_CLIENT: mqtt_process_receive: mqtt_message_receive() returned -1
W (16:30:03.210) MQTT: MQTT disconnected

MQTT is connected for around a minute then disconnects.

5. What I already tried:

I tried the following but it made no difference, probably because websockets aren’t http:

reverse_proxy /ws rabbitmq:15675 {
    transport http {
        keepalive off
    }
}

6. Links to relevant resources:

1 Like

No, this is a great way to do it. Simplifies a bunch of stuff client-side IMO (for general WS, in my experience, can’t comment on MQTT cause I don’t use it).

There’s some keepalive options you can play with in the reverse_proxy's HTTP transport:

I’ve never needed to play around with changing keepalives, so I don’t have any specific recommendations there. But this is for the connection between Caddy and your upstream.

This thread mentions this was added to ListenConfig, which seems like an alternate way to set up a listener. Caddy doesn’t use this, but maybe it could, to make keepalives configurable, for the connection between your client and Caddy.

What’s awkward though is that this server keepalive would be for all connections on the server listening on port 443, so if you were to turn off keepalives, it would also turn them off for regular HTTP traffic (which is maybe not ideal? I dunno).

FYI @matt you might understand this more.

I did try to set keepalive off in the reverse_proxy section but it didn’t have any effect. Apparently with websockets only the initial handshake is HTTP, after that it’s TCP which is probably why it had no effect.

I decided to set up TLS properly in rabbitmq and terminate it there rather than Caddy, and it actually wasn’t that difficult. Connecting straight to the broker fixes the TCP keepalive problem, disconnections and allows much higher keepalive values like 10 minutes.

Having read the rabbitmq TLS documentation I now think that using Caddy for proxying MQTT isn’t a good idea. It does allow a secure connection and allows the client to verify the server, but it doesn’t allow the server to verify the client. Most people will be using LetsEncrypt certificates for a public website where you don’t need to verify the client, but with MQTT you will want to do this, and additionally you probably want each device to have a unique client certificate so they can be revoked if necessary. It doesn’t seem like you can generate LetsEncrypt client certificates because then anyone with a LetsEncrypt server certificate could also generate one.
Proxying through Caddy also makes it more complicated to diagnose any problems as now the connection is going through two things, and there is also the whole issue of keepalives.

I think most people doing IOT will actually be using some hosted service like AWS/GCP which takes away all this complexity and deals with provisioning certificates etc.

That’s right. HTTP is used to set up the initial handshake/connection (it’s convenient), then it gets switched to a duplex TCP pipe.

You can configure client certificate verification in Caddy, actually. See the client_auth config in the tls directive:

But point taken, probably easier to avoid proxying in this case :man_shrugging: