Hangs on reload

Hello francislavoie :slight_smile:
Sadly, there are only one basicauth block and have removed it and tested it, but situation same.

Hello matt :slight_smile:
I have done what you said and got profile, but I could not upload it here (only images are allowed, it seems?) . I extract GIF with go tool pprof profile_file and below is here.


If there are any way I can upload profile file, I will upload it asap.

To add a bit information of profile;

I take 30 minutes of profile (as shown in gif), but reload finished within 20 minutes (maybe about 18 minutes). This time varies by situations :sob: I saw if there are many users accessing caddy web server, it took more than 30 minutes.

You could upload the profile to pastebin or a github gist maybe. Anywhere is fine.

Iā€™m not gleaning much from that image, but it looks like the proxy is slow to write, seems like due to TLS encryption? Are you writing very big responses frequently? Does your machine have hardware acceleration for AES?

Either way I donā€™t see how that relates to the reloads being slow.

Iā€™m only now noticing that your systemd config is non-standard. We ship a recommended one with our debian package, with additional details here if want to set it up manually:

It might be good to switch to using caddy run to reduce the variables at play here (most users use caddy run)

okay, here is link of profile. link

Iā€™ve changed systemd service file to standard, and also manually run caddy with caddy run, but ā€¦ :sob: whyā€¦ no progress stillā€¦

possibly this is related? I do not know what to do more :pensive:

Thanks for the profile!

So these are the top CPU cycles:

File: caddy
Type: cpu
Time: Apr 8, 2021 at 10:16pm (MDT)
Duration: 30mins, Total samples = 24.43s ( 1.36%)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top
Showing nodes accounting for 14.58s, 59.68% of 24.43s total
Dropped 831 nodes (cum <= 0.12s)
Showing top 10 nodes out of 273
      flat  flat%   sum%        cum   cum%
     3.93s 16.09% 16.09%      4.49s 18.38%  syscall.Syscall
     3.14s 12.85% 28.94%      3.14s 12.85%  crypto/aes.gcmAesEnc
     2.01s  8.23% 37.17%      2.01s  8.23%  golang.org/x/crypto/blowfish.encryptBlock
     2.01s  8.23% 45.40%      2.01s  8.23%  runtime.epollwait
     1.57s  6.43% 51.82%      1.57s  6.43%  runtime.futex
     0.58s  2.37% 54.20%      0.58s  2.37%  vendor/golang.org/x/crypto/curve25519.ladderstep
     0.42s  1.72% 55.91%      0.42s  1.72%  runtime.memmove
     0.40s  1.64% 57.55%      0.40s  1.64%  crypto/sha256.block
     0.30s  1.23% 58.78%      4.27s 17.48%  runtime.findrunnable
     0.22s   0.9% 59.68%      1.30s  5.32%  runtime.mallocgc

So over the 30 minute profile, it gathered 24s of CPU time, most of which were spent in system calls and AES/blowfish encryption. I also want to know the answer to Francisā€™ questions - what kind of hardware are you running this on (specifically)?

Another significant chunk of time is in runtime.findrunnable, which appears to have a related bug here: runtime: scheduler is slow when goroutines are frequently woken Ā· Issue #18237 Ā· golang/go Ā· GitHub ā€“ not sure if relevant to your situation.

It still doesnā€™t account for the full 18-20 minutes. Iā€™m not sure why yet.

1 Like

Sorry for late answer, there was some personal pressing matter to concern :sweat_smile:
Anyway, Hardware caddy is running on is bare-metal on premise server, HP ProLiant DL320e Gen8 model, which has cpu of Intel Xeon(R) CPU E3-1240 V2 3.40GHz 4Core 8Thread (HyperThreading), 8GB RAM.

I hope this is not related, but if it does, thenā€¦ maybe I should find another way to deal with; I was okay with nginx, probably I have to switch backā€¦? :frowning:

No, if you switch now, all the troubleshooting effort will go to waste and Iā€™ll get mad. Letā€™s figure this out.

Can I access the instance currently exhibiting the behavior?

OK will never try switching back.

Well there is some firewall issues, so you may need to use VPN, can I contact you privately?

Maybe we could jump on a video chat? Not today ā€“ probably tomorrow or something ā€“ but I can troubleshoot it with you, with you driving. (I wonā€™t take over for liability reasons.)

Iā€™d love it :slight_smile: Thanks for your concern, deeply appreciate it. When is your viable time? As our timezone is different, Iā€™d prepare if I can :slight_smile:

Looks like it does have AES-NI support, FWIW

It seems AES-NI support already enabled on system.

user@localhost:~$ sudo sort -u /proc/crypto | grep module
module       : aes_x86_64
module       : aesni_intel
module       : crc32_pclmul
module       : crct10dif_pclmul
module       : ghash_clmulni_intel
module       : kernel
module       : pcbc

user@localhost:~$ sudo cpuid | grep -i aes | sort | uniq
      AES instruction                         = true

I wonder if thereā€™s something weird with the VPN connection or something else about the network. Can you reproduce the behavior in any other environment?

For the case of network, I do not have to use VPN myself; It is public IP address range with firewall enabled (because of some special environment I am in, every device within this network is given public IP address).

I cannot reproduce same behavior in any other environment (tested in aws vm, but is fast :frowning: )

1 Like

This topic was automatically closed after 30 days. New replies are no longer allowed.

After discussing on a call, we determined that this is probably because Caddy (actually, Go standard library) is doing a graceful shutdown of the servers: http package - net/http - Go Packages

Shutdown gracefully shuts down the server without interrupting any active connections. Shutdown works by first closing all open listeners, then closing all idle connections, and then waiting indefinitely for connections to return to idle and then shut down. If the provided context expires before the shutdown is complete, Shutdown returns the contextā€™s error, otherwise it returns any error returned from closing the Serverā€™s underlying Listener(s).

By default, no grace period is set, so there is no timeout. If a client holds onto connections for 20+ minutes (like if they are downloading a big file very slowly), it could cause connections to be retained. This is an ideal user experience but maybe not so ideal for servers.

The grace period is configurable in JSON: JSON Config Structure - Caddy Documentation - and I bet setting it (say, 5 to 10 seconds, maybe 30 if you really want to) should cause reloads to finish closer to that amount of time.

It is not currently configurable in the Caddyfile, which would be a nice global option to have.

6 Likes

And hereā€™s a quick PR

:rofl:

6 Likes