Docker long running processes timing out a lot

1. Caddy version (2.4.5):

2. How I run Caddy:

docker-compose.yml

a. System environment:

Docker

b. Command:

systemd : docker-compose up

c. Service/unit/compose file:

Paste full file contents here.
Make sure backticks stay on their own lines,
and the post looks nice in the preview pane.

d. My complete Caddyfile or JSON config:

{
    email   email@gmail.com
}


# REVERSE PROXY
# Values from config.env
# If this is localhost, use the port only.
# Only use PROXY_HOST if it is not localhost.
# No need to add PROXY_PORT if it is port 80.

{$PROXY_SERVER_NAME}, www.{$PROXY_SERVER_NAME} {
    reverse_proxy {$INTERNAL_SERVER_NAME} {
        header_up Host {$MAIN_UPSTREAM_HOST}
    }
}

{$SOCK_PROXY_NAME}, www.{$SOCK_PROXY_NAME} {
    reverse_proxy {$INTERNAL_SERVER_NAME} {
        header_up Host {$SOCK_UPSTREAM_HOST}
    }
}

#check for any docker image updates
http://wud.localhost {
    reverse_proxy wud:3000
}

3. The problem I’m having:

I am using a docker-compose file to run my docker containers.
Everything ran great until I started getting timeouts on long-running tasks.
My tasks can take anywhere from 30 seconds to 3 hours, and it includes downloading files, processing them, then finally uploading them.

I am getting the following error a lot from my main server:

2021/10/30 06:15:28 write tcp 172.18.0.2:8000->172.18.0.4:47336: write: broken pipe
write tcp 172.18.0.2:8000->172.18.0.4:47336: write: broken pipe
write tcp 172.18.0.2:8000->172.18.0.4:47336: write: broken pipe
write tcp 172.18.0.2:8000->172.18.0.4:47336: write: broken pipe

172.18.0.4 is my Caddy reverse-proxy on the host network.
172.18.0.2 is my main server on a bridge network.

Is there anything I can do on the docker-compose file, docker settings, or Caddyfile to prevent this from happening?
I am at my wit’s end.

On further checking syslog I can see the following:
Oct 30 06:15:26 vps-18000 docker-compose[964]: #033[36mcaddyserver |#033[0m {"level":"error","ts":1635596126.2547772,"logger":"http.handlers.reverse_proxy","msg":"aborting with incomplete response","error":"context canceled"}

4. Error messages and/or full log output:

Oct 30 06:15:26 vps-18000 docker-compose[964]: #033[36mcaddyserver |#033[0m {"level":"error","ts":1635596126.2547772,"logger":"http.handlers.reverse_proxy","msg":"aborting with incomplete response","error":"context canceled"}

5. What I already tried:

I have made sure that my actual server is not the cause. It is not outputting any programmatic errors apart from the above-mentioned.

6. Links to relevant resources:

1 Like

do you see anything announcing in dmesg -T at the physical host level that corresponds with your timeout events?

2 Likes

also docker version output, and if you could share your docker-compose.yml for reference (or reproduction if needed)

1 Like
Client: Docker Engine - Community
 Version:           20.10.10
 API version:       1.41
 Go version:        go1.16.9
 Git commit:        b485636
 Built:             Mon Oct 25 07:43:17 2021
 OS/Arch:           linux/amd64
 Context:           default
 Experimental:      true

Server: Docker Engine - Community
 Engine:
  Version:          20.10.10
  API version:      1.41 (minimum version 1.12)
  Go version:       go1.16.9
  Git commit:       e2f740d
  Built:            Mon Oct 25 07:41:26 2021
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.4.11
  GitCommit:        5b46e404f6b9f661a205e28d59c982d3634148f8
 runc:
  Version:          1.0.2
  GitCommit:        v1.0.2-0-g52b36a2
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0
1 Like

dmesg -T won’t go that far back but I think I found that message at the end of syslog

Oct 30 06:15:40 vps-18000 kernel: [74921.840161] [UFW BLOCK] IN=eth0 OUT= MAC=fa:16:3e:37:fc:ec:ea:03:5b:eb:e4:1c:08:10 SRC=162.142.125.106 DST=51.222.30.145 LEN=44 TOS=0x00 PREC=0x00 TTL=39 ID=60598 PROTO=TCP SPT=27925 DPT=7922 WINDOW=1024 RES=0x00 SYN URGP=0 
1 Like

I realize it sucks to have to reproduce to get answers (and its a stretch) but the ask for the dmesg is to see if anything network physical or driver layer or below docker is emitting around the events. If the physical host is restarted the dmesg will start over (I think).

I’ll search around too, if you have another event it’ll be worth re-checking.

1 Like

https://www.linode.com/docs/guides/configure-firewall-with-ufw/

This is an interesting general discussion involving UFW block (obviously our goal is is NOT to block these messages, but it gives context I think).

Is this system internet facing? Are you perhaps starved of having enough resources (file descriptors) allocated to the container?

(I cant find docker documentation on it)

Interesting docker related project chatting about docker/UFW not playing well together (but really old discussion, might give more context to investigate tho)?

3 Likes

(Thank you for helping out @agrajag, as I have to admit I think this one’s over my head, ha.)

2 Likes

@Ed_Siror - sorry to be spamming you but I’m very interested in understanding what can go wrong here as we are about to migrate to a similar architecture.

Two further discussions here for deeper understanding that might be worth looking at to help in considerations around researching & troubleshooting things. I’m not sure if the UFW Block is a red herring, but as you described things are definitely hitting a broken pipe scenario.

And an article that this first one references as well for broken pipe context

My takeaway from this is (if UFW is truly the timeout event trigger) to test and try to reproduce with UFW disabled (this might not be possible for security reasons).

I don’t think that file descriptors would trigger the UFW block (we would see something else) but it still would be good to evaluate if there is file descriptor starvation taking place in your container as the discussions do point out some good troubleshooting scenarios (I dont know if SNAT is in use but its worth still reading through and the symptom of broken pipe scenarios come up in the discussions).

Another approach to troubleshooting would be to have a very limited tcpdump running at the physical host layer that is specifically filtering on the source and target container IP’s for HTTP(S) only and capturing enough of a profile of the connections as you attempt to recreate, and the use the capture to evaluate the network conversation up to the connection “timing out” or hitting a “broken pipe”. There is additional discussion on how docker networking is “like SNAT” but I dont want to flood with references without deeper understanding of the network layout between the two containers.

This discussion points out ways to instrument things to look at things from container to host. You could potentially refine the capture further but this should not be a massive amount of data to capture the conversation (and any resets or things taking place).

Or just focus on the “main (physical)” host looking at the trafic between the containers I’m not 100% on this but it might capture properly, I don’t have a docker based config in place to test, but from what I’ve read of its networking it should work.

Note that the dst port is most likely ephemeral and within a range randomly, so there is no point to using the same port as your broken pipe messaging is referencing.

sudo tcpdump -nn -s 64 -K -N -w troubleshoot.pcap -i any '(src host 172.18.0.22 and port 8000) and dst host 172.18.0.4'

if your not familiar with tcpdump, use ctrl-c to end the capture when you realized a broken pipe has taken place, obviously making note of when (and run the dmesg -T as well)

-i all is listen on all interfaces
-nn don’t resolve IP to host or port (reduces size of capture & eliminates delays in the capture)
-K dont verify checksums
-s 64 grab 64 bytes of capture (I don’t think the payload is significant here & helps keep the capture file small) 32 bytes could probably be used if captures are getting too big…
-w <filename> write capture to file (so you can look at the network conversation in wireshark)

2 Likes

Thanks for all the contributions. I have been busy reading all of this. I might have found two defer statements that have been working for the past 3 years. I will try to manually close those and see if it changes anything. This won’t be the first time a defer statement ruined my month.

2 Likes

@agrajag , thanks a lot. Some of your ideas helped me troubleshoot. Definitely, defer statements had something to do with it. The context canceled error is now gone. I only saw one instance of http2: stream closed error. Now I need to figure out what is closing the stream early and I should be done. I am guessing one of my Close() comes too early.

1 Like

This topic was automatically closed after 30 days. New replies are no longer allowed.