Docker long running processes timing out a lot

agrajag · November 1, 2021, 5:35am

@Ed_Siror - sorry to be spamming you but I’m very interested in understanding what can go wrong here as we are about to migrate to a similar architecture.

Two further discussions here for deeper understanding that might be worth looking at to help in considerations around researching & troubleshooting things. I’m not sure if the UFW Block is a red herring, but as you described things are definitely hitting a broken pipe scenario.

And an article that this first one references as well for broken pipe context

My takeaway from this is (if UFW is truly the timeout event trigger) to test and try to reproduce with UFW disabled (this might not be possible for security reasons).

I don’t think that file descriptors would trigger the UFW block (we would see something else) but it still would be good to evaluate if there is file descriptor starvation taking place in your container as the discussions do point out some good troubleshooting scenarios (I dont know if SNAT is in use but its worth still reading through and the symptom of broken pipe scenarios come up in the discussions).

Another approach to troubleshooting would be to have a very limited tcpdump running at the physical host layer that is specifically filtering on the source and target container IP’s for HTTP(S) only and capturing enough of a profile of the connections as you attempt to recreate, and the use the capture to evaluate the network conversation up to the connection “timing out” or hitting a “broken pipe”. There is additional discussion on how docker networking is “like SNAT” but I dont want to flood with references without deeper understanding of the network layout between the two containers.

This discussion points out ways to instrument things to look at things from container to host. You could potentially refine the capture further but this should not be a massive amount of data to capture the conversation (and any resets or things taking place).

Or just focus on the “main (physical)” host looking at the trafic between the containers I’m not 100% on this but it might capture properly, I don’t have a docker based config in place to test, but from what I’ve read of its networking it should work.

Note that the dst port is most likely ephemeral and within a range randomly, so there is no point to using the same port as your broken pipe messaging is referencing.

sudo tcpdump -nn -s 64 -K -N -w troubleshoot.pcap -i any '(src host 172.18.0.22 and port 8000) and dst host 172.18.0.4'

if your not familiar with tcpdump, use ctrl-c to end the capture when you realized a broken pipe has taken place, obviously making note of when (and run the dmesg -T as well)

-i all is listen on all interfaces
-nn don’t resolve IP to host or port (reduces size of capture & eliminates delays in the capture)
-K dont verify checksums
-s 64 grab 64 bytes of capture (I don’t think the payload is significant here & helps keep the capture file small) 32 bytes could probably be used if captures are getting too big…
-w <filename> write capture to file (so you can look at the network conversation in wireshark)