Caddy intermittent i/o timeout

1. The problem I’m having:

I’m having issues with Caddy and/or Docker and I need assistance figuring out which it is. I noticed the issue by inspecting traffic from a hosted application (also in Docker) and noticing some requests returning a 502 error. But never the same requests, and only sometimes. This is reported in the Caddy logs as an i/o timeout. By exec’ing into the Caddy container, I attempted to ping the offending application container and got the following:

Screenshot 2024-11-20 at 12.17.40 PM

Seemingly completely random packet loss. When this happens it affects all containers and all pinging/communication between containers and I cannot for the life of me figure out why, it’s almost like Docker’s internal DNS starts to die over time. Recreating a new docker network and moving all containers to it immediately resolves the issue, but only for a finite amount of time. Could be days, could be months.

While a cloudflare tunnel is used for external access, when on my home network (which is where I’m experiencing this issue) all traffic to my domains is routed directly to my NAS machine (via a dnsmasq line in my pi-hole which serves as my network’s DNS server). I know this is working by doing:

# dig audiobookshelf.asandhu.ca

; <<>> DiG 9.10.6 <<>> audiobookshelf.asandhu.ca
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 30145
;; flags: qr aa rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1232
;; QUESTION SECTION:
;audiobookshelf.asandhu.ca.	IN	A

;; ANSWER SECTION:
audiobookshelf.asandhu.ca. 0	IN	A	192.168.1.29

;; Query time: 14 msec
;; SERVER: 192.168.1.28#53(192.168.1.28)
;; WHEN: Wed Nov 20 12:47:47 MST 2024
;; MSG SIZE  rcvd: 70

So in essence, I’m ignoring cloudflare completely on my local network and all traffic should be direct to the machine from any local clients. Another interesting fact, when I access a container by local IP, and not domain, I get no issues, every request is a 200, and significantly faster. That is the only part that makes me think Caddy may be involved in the issue else I’d be pretty confident it has to do with Docker. Thanks for any help!

2. Error messages and/or full log output:

ERR ts=1732131047.9271493 logger=http.log.error msg=dial tcp 172.19.0.14:80: i/o timeout request={"remote_ip":"192.168.1.36","remote_port":"52433","client_ip":"192.168.1.36","proto":"HTTP/2.0","method":"GET","host":"audiobookshelf.asandhu.ca","uri":"/_nuxt/a3e358e.js","headers":{"Sec-Fetch-Site":["same-origin"],"If-None-Match":["W/\"13776-1933f975e68\""],"User-Agent":["Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:132.0) Gecko/20100101 Firefox/132.0"],"Accept":["*/*"],"Accept-Language":["en-US,en;q=0.5"],"Sec-Fetch-Dest":["script"],"Sec-Fetch-Mode":["no-cors"],"Te":["trailers"],"Accept-Encoding":["gzip, deflate, br, zstd"],"Referer":["https://audiobookshelf.asandhu.ca/library/c2235e03-295d-4f19-b767-e5bdf32dc81c"],"Alt-Used":["audiobookshelf.asandhu.ca"],"Cookie":["REDACTED"],"If-Modified-Since":["Mon, 18 Nov 2024 14:05:05 GMT"]},"tls":{"resumed":false,"version":772,"cipher_suite":4865,"proto":"h2","server_name":"audiobookshelf.asandhu.ca"}} duration=3.003095031 status=502 err_id=91r5du7j3 err_trace=reverseproxy.statusError (reverseproxy.go:1269)

3. Caddy version:

v2.8.4

4. How I installed and ran Caddy:

a. System environment:

Debian running OpenMediaVault 7.4.13 running docker-ce 5:27.3.1-1~debian.12~bookworm

b. Command:

Ran via Portainer.

c. Service/unit/compose file:

version: "3.7"
services:
  caddy:
    container_name: caddy
    environment:
      - CLOUDFLARE_API_TOKEN="REDACTED"
      - PGID=100
      - PUID=1000
    image: ghcr.io/iarekylew00t/caddy-cloudflare:latest
    hostname: caddy
    ports:
      - "80:80"
      - "443:443"
      - "443:443/udp"
    restart: unless-stopped
    volumes:
      - /srv/dev-disk-by-uuid-3cc85caf-bda8-478d-8830-b467533075fe/docker-compose/caddy/Caddyfile:/etc/caddy/Caddyfile
      - /home/dockeruser/docker/config/caddy/config:/config
      - /home/dockeruser/docker/config/caddy/data:/data

networks:
  default:
    name: caddy_network
    external: true

d. My complete Caddy config:

(cloudflare) {
    encode gzip
    tls {
        dns cloudflare REDACTED
        resolvers 1.1.1.1
    }
}

{
    debug
}

*.asandhu.ca {
    import cloudflare

    @audiobookshelf host audiobookshelf.asandhu.ca
    reverse_proxy @audiobookshelf audiobookshelf:80

    @immich host immich.asandhu.ca
    reverse_proxy @immich immich_server:2283

    @jellyfin host jellyfin.asandhu.ca
    reverse_proxy @jellyfin jellyfin:8096
    
    @nas host nas.asandhu.ca
    reverse_proxy @nas 192.168.1.29:82

    @pihole host pihole.asandhu.ca
    reverse_proxy @pihole 192.168.1.28
    
    @portainer host portainer.asandhu.ca
    reverse_proxy @portainer 192.168.1.29:9000
}

e. Docker Network

root@nas:~# docker network inspect caddy_network
[
    {
        "Name": "caddy_network",
        "Id": "587d44cbc57255715c708a3f05017a26f37050d15bcd3eb21def8b821df0a211",
        "Created": "2024-11-17T17:35:00.780975271-07:00",
        "Scope": "local",
        "Driver": "bridge",
        "EnableIPv6": false,
        "IPAM": {
            "Driver": "default",
            "Options": null,
            "Config": [
                {
                    "Subnet": "172.19.0.0/16",
                    "Gateway": "172.19.0.1"
                }
            ]
        },
        "Internal": false,
        "Attachable": false,
        "Ingress": false,
        "ConfigFrom": {
            "Network": ""
        },
        "ConfigOnly": false,
        "Containers": {
            "2dbbdb9fcc56f7bac3aa6c07e2d343c77a36c33fb3ba8391341619cf9bc07de9": {
                "Name": "immich_machine_learning",
                "EndpointID": "a0d527725cdb851f1583ca5a65c0dbfe26da3e1a6b87d7c0709332315ebe8f34",
                "MacAddress": "02:42:ac:13:00:0f",
                "IPv4Address": "172.19.0.15/16",
                "IPv6Address": ""
            },
            "33ca0131340bdd22631f95a5807f360c34f7e78ff68aa39952b6cc058f52e23c": {
                "Name": "immich_postgres",
                "EndpointID": "9ad42de1b91f32ccc7c4fe697cb6d23afafdd678f9bfbbe888eeef46e2c0739e",
                "MacAddress": "02:42:ac:13:00:03",
                "IPv4Address": "172.19.0.3/16",
                "IPv6Address": ""
            },
            "5cfdd6da608d944e71573e106c8e60e87d6a5dd63b3a1875bd6fb2b534c3da68": {
                "Name": "immich_server",
                "EndpointID": "ca34707a1f1a953e086fc98ff320644bf7647ab65b0ea092f014b1c4ee83f0cb",
                "MacAddress": "02:42:ac:13:00:0d",
                "IPv4Address": "172.19.0.13/16",
                "IPv6Address": ""
            },
            "88f05f96d5168516e48bb4bc9a05f10958af82b80259889e4c5700ee5c441bc1": {
                "Name": "jellyfin",
                "EndpointID": "87accc57fd92e77ac45ff59ae9150e2fed3efddd0b6643da1229b1fae6a31e41",
                "MacAddress": "02:42:ac:13:00:0c",
                "IPv4Address": "172.19.0.12/16",
                "IPv6Address": ""
            },
            "977e732afe8b57070ea8f97770ed1e4ed5c82ab08af328996ff7bc45598d618b": {
                "Name": "audiobookshelf",
                "EndpointID": "64aab0e717bbb62ecfc76cf952330809e87db30f9a24dd65a17f520850cc9678",
                "MacAddress": "02:42:ac:13:00:02",
                "IPv4Address": "172.19.0.14/16",
                "IPv6Address": ""
            },
            "ab304140777cfcd9aa374b1745f657ad09e4113566cdb229f33b0fb60019f929": {
                "Name": "cloudflare-tunnel",
                "EndpointID": "302127c9e7fa42be3100869a4e13acc82bfc5663068d90c0462b88185efeced6",
                "MacAddress": "02:42:ac:13:00:02",
                "IPv4Address": "172.19.0.2/16",
                "IPv6Address": ""
            },
            "b77172af768c69a4aa670c9baa15b2224270723ff5a25890ff4bd61b3bbcf1d2": {
                "Name": "caddy",
                "EndpointID": "2ad1a8b38161ee4726d172e16902b72aa5c63e3cc730809c2631b2070c33c0ca",
                "MacAddress": "02:42:ac:13:00:0a",
                "IPv4Address": "172.19.0.10/16",
                "IPv6Address": ""
            },
            "eb804b5d337b86ac7f7698aea04ba6e5caf5c5e9c34f84f89dfdc52eec488f46": {
                "Name": "immich_redis",
                "EndpointID": "cf93b86921b5d3723ae426e5e55c354f81ff3de5a9df8dec4eaa1336c52035a2",
                "MacAddress": "02:42:ac:13:00:07",
                "IPv4Address": "172.19.0.7/16",
                "IPv6Address": ""
            }
        },
        "Options": {},
        "Labels": {}
    }
]

Wait, you’re getting packet loss between your Caddy container and your other containers? And those containers are all running on the same machine? wat. That’s definitely not a Caddy issue then. I don’t have a clue what that could be. Might need to do a fresh install on that machine in case it’s some crazy OS-level issue causing networking trouble. Or it’s some weird Docker network config issue. Networking stuff like this is outside of my expertise.

That said, some config tips:

Global options should always go first, snippets and sites after.

I recommend using this pattern with handle blocks instead, so you can more easily have a fallback for unmatched subdomains, and more flexibility if you need to do anything other than reverse_proxy for another service, because otherwise you’d be contending with Caddy’s directive ordering.

Thanks for the tips on the Caddyfile. I agree it seems to be a Docker/other issue but the randomness of it makes it really difficult to track down. Also this:

Makes me believe it has something to do with the domain, or Caddy at some level…although then it doesn’t make sense that exec’ing into the container and pinging another would show the same issue. :confused:

It’s kinda boggling my mind a little bit that this issue cares about DNS?

Technically speaking DNS resolution and the actual connection are two separate procedures. First we turn a hostname into an IP address with domain name resolution, THEN we connect to that IP address. Once we have the IP, DNS shouldn’t come into it.

So to hear you describe pinging another container by service name and losing all those packets? But connecting over IP without using DNS and you get solid fast connections every time? My knowledge is not deep enough into the networking stack to even conceive of how that could be possible, let alone how to fix it. Each individual ICMP packet ping sends has no DNS-related information and should be indistinguishable from an ICMP packet where ping used DNS to resolve the IP first. So how could a network stack discriminate between those?

To be completely, brutally honest - I’m with Francis insofar as I would nuke the machine from orbit (cattle, not pets) and reimplement the Compose stack on a brand new or known good machine. At a certain level it just becomes more efficient than to try and learn about something you didn’t know you don’t know about and might never be an issue again.

Good point. Once the pings are going out, DNS shouldn’t even be involved. As is evident by the command output itself, it already knows the IP. It’s also totally possible that I’m just getting lucky when I hit it by IP instead of domain in browser and it’s leading me to think it’s behaving differently when it really isn’t.

I’m trying to treat nuking the machine as a last resort currently but that does seem to be way this is heading unfortunately.

For posterity, with some help from the Docker community, I think the root cause for this has been identified:

As you can see from the network inspect I posted, audiobookshelf and cloudflare-tunnel share a MAC address, likely causing this odd behaviour due to the collision. I’m not clear yet on if this is a Portainer specific issue, or docker itself, but regardless I should be able to workaround it now that I at least know the issue. Thanks for all your help!

1 Like

Wow that’s insane. Good to know. From my reading of that issue, seems like it’s Portainer’s fault, but they don’t have the information necessary to make a smarter decision because Docker doesn’t have a way to provide it.